Compilers
 
Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
 
User Name:
Password:
Remember me
Go Back   Web Development Archives FAQs Compilers

Reply
Add This Thread To:
  Del.icio.us   Digg   Google   Spurl   Blink   Furl   Simpy   Y! MyWeb 
Thread Tools Search this Thread Display Modes
 
Unread Web Development Archives Sponsor:
  #1  
Old May 14th, 2008, 12:01 PM
Hans-Peter Diettrich
Guest
Dev Archives Newbie (0 - 499 posts)
 
Posts: n/a  
Time spent in forums:
Reputation Power:
Tokenizer theory and practice

I'm currently thinking about lexing/tokenizing binary data and Unicode,
in e.g. PDF files, and possible optimizations in LL/PEG parsers.

Binary data typically comes in fixed length chunks, where the parser has
to provide the length and encoding of the next token. This procedure is
quite different from lexing text, that's why I prefer the term
"tokenizer" in this context, and bottom-up (LR) parsers seem not to be
very usable in this case. A lexer at best can skip over binary data,
what often is sufficient, but then it should not be confused by embedded
null "characters" in the input. Traditional lexers often use null
characters as end markers, for either the end of the whole input or for
the end of the input buffer, so that this value needs special handling
anyways. Then it should not be hard to add another meaning to such
bytes, as legal members of the expected character set. But how to
implement that handling?

IM the most time-efficient procedure would use an pointer into an input
buffer, so that special handling (e.g. fill_buffer) is only required
when a null byte is detected. A dedicated getter procedure instead can
implement transparent buffering, returning an dedicated (epsilon/EF)
value only at the real end of the input. The latter case seems to be
more suitable with lexer generators, where null bytes could be
substituted by some non-zero code, which would not require changes to an
automatically generated lexer, apart from handling an extended character
set of at least 257 members. ?


Unicode introduces a couple of problems into lexers, which I don't want
to discuss too deeply. Most important seems to be the expansion of the
character codes, from single to multiple bytes. Together with various
forms of encoding (UTF), a getter method seems to be preferable over
using pointers. Then the getter can implement and encapsulate the
encoding, so that the lexer only has to deal with uniform (4 byte)
character codes.

In formal languages a mix of ASCII (keywords) and "real" Unicode parts
(literals) can be expected, so that an optimized tokenizer should be
able to treat these parts differently. Again a top-down parser seems to
be preferable, because it can switch the tokenizer into the different
modes (SBCS, MBCS, binary). But wouldn't it make sense, then, to
replace the lexer/tokenizer by multiple "matchers", which are called by
the parser as appropriate, and which can e.g. match expected literals
(required keywords) without guessing of the kind of the next token.
But can such specialized functions really speed up or simplify parsing?


Back to binary files, including PDF, ZIP and other file formats with
structured content, a multi-level parser seems to be required in most
cases, which can separate the file into distinct parts, which can (often
must) be processed in random order. I vaguely remember an approach to
formalize the description and decoding of binary files - can somebody
give me more precise hints in that direction?

DoDi

Reply With Quote
  #2  
Old May 17th, 2008, 12:20 AM
Hans-Peter Diettrich
Guest
Dev Archives Newbie (0 - 499 posts)
 
Posts: n/a  
Time spent in forums:
Reputation Power:
Tokenizer theory and practice

cr88192 schrieb:

>Back to binary files, including PDF, ZIP and other file formats with
>structured content, a multi-level parser seems to be required in most
>cases, which can separate the file into distinct parts, which can (often
>must) be processed in random order. I vaguely remember an approach to
>formalize the description and decoding of binary files - can somebody
>give me more precise hints in that direction?
>

IM, parsers of this sort (especially text-intended parser
generators), are not a good approach to parsing binary data, nor do I
believe are the formalisms often employed, which are themselves
intended for textual structures.

That's why I mentioned multi-level parsers, where the top level handles
the file structure. In PDF files the file structure is textual, with
embedded binary streams. In other files the structure can be binary,
with embedded text snippets.


Rather, I suggest looking at how such formats are actually structured,
and how they are commonly represented within format specs, and try to
derive from this a general notation, formalism, and possible
implementation.

It would be nice to start with some already existing formalism.

an issue I think:
there are a wide variety of possible ways in which data can be, and often
is, structured:
uniform-sized blocks (common in filesystems, formats like GCF, B-Trees,
);
structures and offsets (ELF, CFF, );
TLV structures (RIFF, PNG, ASN.1, );
bitstreams (Deflate, );
bytestreams (Java class, WBXML, many simpler LZ77 variants, );
escaped tags (JPEG, MPEG, FLAC, ZIP, );

>

So, rather than having a single formalism (such as LL or LR), more
likely, one will end up with a kind of mini programming language just
to specify how the pieces fit together.

I hope to separate the data structures, as syntactical elements, from
attributes etc. as kind of semantic code. In the best case it should be
possible to derive both an serializer and an de-serializer from a given
formal description.


presumed features:
structs (explicit on-disk layout), conditionals, support for variable length
fields and members, ability to deal with file offsets and similar ideas,
>

even then, there is likely to be a whole lot more that can't be formalized,
than that can be.

I don't expect that a single formalism can cover really all weird file
formats. But it would be nice to have more than a description of the
basic "tokens", which cannot be much more than numbers or strings - in a
lot of different formats, of course. I don't want to decode bitstreams
or encrypted data, that would be a task for an automaton in an sub-parser.


more likely, one can end up more with what would amount to a format-design
tool, than something that can actually reliably parse existing formats.
>

even then, it is usually not parsing the format that is the work, rather it
is processing the data that is usually a lot more work.

The task of the parser is finished, when a file is converted into a tree
structure, possibly with additional tables for file offsets, tag names
and more.

for example (just pulling something out of nowhere):
struct PACK_Entry {
char name[56];
u32le offset;
u32le size;
byte@offset data[size];
};
>

struct PACK_Header {
FURCC magic="PACK";
u32le offset;
u32le ents;
PACK_Entry@offset dir[ents];
};

Such a description indeed will cover the most common data structures, of
a fixed or counted length. We'll have to add (zero) terminated items
(strings and other vectors), and multi-level descriptions for offsets
(relative to some file section, table lookup). A "whitespace"
equivalent will be required as well, for the description of alignment
and padding.


now, personally, I don't usually use such formalisms or tools, since
personally I have usually found it to be most effective simply to
write the parser myself, and if likely it were beyond my abilities to
write a parser, it would also likely be somewhat beyond the abilities
of these formalisms as well

I've decoded many file formats myself, and my most useful tool was a
hex viewer with formatting capabilities. It could display files as
tables of fixed-size records, where the record formats were described
by strings, with 'c' for a character, 'l' for a longword etc. The
table positions and formats were written to an file, ready for later
reuse. What I'm still missing is the next step, with more
sophisticated types, and an ability to convert the fixed table offsets
and counts into computed elements, as templates for viewing files of
the same type. It may be more promising to extend that old tool with
the capabilities, mentioned above, instead of formalizing all aspects
of automated parsing of non-textual files. BTW, IDA is a fine example
for such a tool

DoDi

Reply With Quote
  #3  
Old May 17th, 2008, 05:01 PM
Hans-Peter Diettrich
Guest
Dev Archives Newbie (0 - 499 posts)
 
Posts: n/a  
Time spent in forums:
Reputation Power:
Tokenizer theory and practice

Dmitry A. Kazakov schrieb:

>Binary data typically comes in fixed length chunks, where the parser has
>to provide the length and encoding of the next token. This procedure is
>quite different from lexing text, that's why I prefer the term
>"tokenizer" in this context, and bottom-up (LR) parsers seem not to be
>very usable in this case. A lexer at best can skip over binary data,
>what often is sufficient, but then it should not be confused by embedded
>null "characters" in the input.
>

I don't really understand the problem, maybe, you can elaborate
it. Why NUL is a problem and why tokens need to be "raw text."

These are problems of many (C based) lexer generators. A conversion,
into anything but an built-in data type, requires semantic code, which
makes a formal grammar usable only with the language of that code. Few
generators come with a built-in language for doing such conversions,
like MetaS or TextTransformer.

When I do similar stuff, I do it in a way that the parser returned
typed objects rather than copies of the source. The whole idea to
copy the source is bogus, IM

Indeed, textual copies are of little use. Can you suggest a
descriptive formalism for the objects, returned by an lexer?

DoDi

Reply With Quote
  #4  
Old May 19th, 2008, 08:50 PM
Hans-Peter Diettrich
Guest
Dev Archives Newbie (0 - 499 posts)
 
Posts: n/a  
Time spent in forums:
Reputation Power:
Tokenizer theory and practice

cr88192 schrieb:

>I hope to separate the data structures, as syntactical elements, from
>attributes etc. as kind of semantic code. In the best case it should be
>possible to derive both an serializer and an de-serializer from a given
>formal description.
>>

>

ok.
>

so, one first describes all of the pieces, and then later how they are
assembled?
ok, but it may be difficult to pull off

You are right. Looking at some popular binary file formats, the
construction (writing) must be done sequentially, with possible
patches of offsets in preceding tables. When an offset table is
written last, it must be read in first, from the end of the file. I'm
not sure whether a formal description of such procedures is possible,
for use in both reading and writing such an file. An according grammar
may become context sensitive, what classifies the problem as very
interesting from the scientific point of view, but it's unlikely that
it will result in a usable tool.


as noted, I also said that there would be conditionals, that or we could
also support BNF-based descriptions.
>

u64 uvli() {
byte v;
return((v&0x80)?(((v&0x7f)<<7)|uvli()):(v&0x7f));
};
>

u64 svli() {
uvli v;
return((v&1)?(-((v+1)>>1)):(v>>1));
}
>

of course, this does not make it clear how to encode these types

A common example would be UTF-8 encoding/decoding, which is easily
described in a procedural way, and possibly also in a grammar, but
deriving the code from such a grammar seems to exceed my capabilities.
<sigh>


your intent is for reverse engineering or something?
that is what this sounds like to me at least

That's the background, how I came to parsers at all ;-)

usually for more complex or bulky formats, I write special tools

After considering all the topics, mentioned in this thread, I better
leave the theory to other people, too. As you stated before:
>>

more likely, one can end up more with what would amount to a
format-design tool, than something that can actually reliably parse
existing formats.
<<

DoDi

Reply With Quote
Reply

Viewing: Web Development Archives FAQs Compilers > Tokenizer theory and practice


Thread Tools  Search this Thread 
Search this Thread:

Advanced Search
Display Modes  Rate This Thread 
Rate This Thread:


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are Off
[IMG] code is On
HTML code is Off
View Your Warnings | New Posts | Latest Threads | Shoutbox
Forum Jump


Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
  
 





© 2003-2008 by Developer Shed. All rights reserved. DS Cluster 5 hosted by Hostway
Stay green...Green IT