|
|
|
| |||||||||
![]() |
|
|
«
Previous Thread
|
Next Thread
»
|
Thread Tools | Search this Thread | Display Modes |
|
#1
|
|||
|
|||
|
Tokenizer theory and practice
I'm currently thinking about lexing/tokenizing binary data and Unicode,
in e.g. PDF files, and possible optimizations in LL/PEG parsers. Binary data typically comes in fixed length chunks, where the parser has to provide the length and encoding of the next token. This procedure is quite different from lexing text, that's why I prefer the term "tokenizer" in this context, and bottom-up (LR) parsers seem not to be very usable in this case. A lexer at best can skip over binary data, what often is sufficient, but then it should not be confused by embedded null "characters" in the input. Traditional lexers often use null characters as end markers, for either the end of the whole input or for the end of the input buffer, so that this value needs special handling anyways. Then it should not be hard to add another meaning to such bytes, as legal members of the expected character set. But how to implement that handling? IM the most time-efficient procedure would use an pointer into an input buffer, so that special handling (e.g. fill_buffer) is only required when a null byte is detected. A dedicated getter procedure instead can implement transparent buffering, returning an dedicated (epsilon/EF) value only at the real end of the input. The latter case seems to be more suitable with lexer generators, where null bytes could be substituted by some non-zero code, which would not require changes to an automatically generated lexer, apart from handling an extended character set of at least 257 members. ? Unicode introduces a couple of problems into lexers, which I don't want to discuss too deeply. Most important seems to be the expansion of the character codes, from single to multiple bytes. Together with various forms of encoding (UTF), a getter method seems to be preferable over using pointers. Then the getter can implement and encapsulate the encoding, so that the lexer only has to deal with uniform (4 byte) character codes. In formal languages a mix of ASCII (keywords) and "real" Unicode parts (literals) can be expected, so that an optimized tokenizer should be able to treat these parts differently. Again a top-down parser seems to be preferable, because it can switch the tokenizer into the different modes (SBCS, MBCS, binary). But wouldn't it make sense, then, to replace the lexer/tokenizer by multiple "matchers", which are called by the parser as appropriate, and which can e.g. match expected literals (required keywords) without guessing of the kind of the next token. But can such specialized functions really speed up or simplify parsing? Back to binary files, including PDF, ZIP and other file formats with structured content, a multi-level parser seems to be required in most cases, which can separate the file into distinct parts, which can (often must) be processed in random order. I vaguely remember an approach to formalize the description and decoding of binary files - can somebody give me more precise hints in that direction? DoDi |
|
#2
|
|||
|
|||
|
Tokenizer theory and practice
cr88192 schrieb:
>Back to binary files, including PDF, ZIP and other file formats with >structured content, a multi-level parser seems to be required in most >cases, which can separate the file into distinct parts, which can (often >must) be processed in random order. I vaguely remember an approach to >formalize the description and decoding of binary files - can somebody >give me more precise hints in that direction? > IM, parsers of this sort (especially text-intended parser generators), are not a good approach to parsing binary data, nor do I believe are the formalisms often employed, which are themselves intended for textual structures. That's why I mentioned multi-level parsers, where the top level handles the file structure. In PDF files the file structure is textual, with embedded binary streams. In other files the structure can be binary, with embedded text snippets. Rather, I suggest looking at how such formats are actually structured, and how they are commonly represented within format specs, and try to derive from this a general notation, formalism, and possible implementation. It would be nice to start with some already existing formalism. an issue I think: there are a wide variety of possible ways in which data can be, and often is, structured: uniform-sized blocks (common in filesystems, formats like GCF, B-Trees, ); structures and offsets (ELF, CFF, ); TLV structures (RIFF, PNG, ASN.1, ); bitstreams (Deflate, ); bytestreams (Java class, WBXML, many simpler LZ77 variants, ); escaped tags (JPEG, MPEG, FLAC, ZIP, ); > So, rather than having a single formalism (such as LL or LR), more likely, one will end up with a kind of mini programming language just to specify how the pieces fit together. I hope to separate the data structures, as syntactical elements, from attributes etc. as kind of semantic code. In the best case it should be possible to derive both an serializer and an de-serializer from a given formal description. presumed features: structs (explicit on-disk layout), conditionals, support for variable length fields and members, ability to deal with file offsets and similar ideas, > even then, there is likely to be a whole lot more that can't be formalized, than that can be. I don't expect that a single formalism can cover really all weird file formats. But it would be nice to have more than a description of the basic "tokens", which cannot be much more than numbers or strings - in a lot of different formats, of course. I don't want to decode bitstreams or encrypted data, that would be a task for an automaton in an sub-parser. more likely, one can end up more with what would amount to a format-design tool, than something that can actually reliably parse existing formats. > even then, it is usually not parsing the format that is the work, rather it is processing the data that is usually a lot more work. The task of the parser is finished, when a file is converted into a tree structure, possibly with additional tables for file offsets, tag names and more. for example (just pulling something out of nowhere): struct PACK_Entry { char name[56]; u32le offset; u32le size; byte@offset data[size]; }; > struct PACK_Header { FURCC magic="PACK"; u32le offset; u32le ents; PACK_Entry@offset dir[ents]; }; Such a description indeed will cover the most common data structures, of a fixed or counted length. We'll have to add (zero) terminated items (strings and other vectors), and multi-level descriptions for offsets (relative to some file section, table lookup). A "whitespace" equivalent will be required as well, for the description of alignment and padding. now, personally, I don't usually use such formalisms or tools, since personally I have usually found it to be most effective simply to write the parser myself, and if likely it were beyond my abilities to write a parser, it would also likely be somewhat beyond the abilities of these formalisms as well I've decoded many file formats myself, and my most useful tool was a hex viewer with formatting capabilities. It could display files as tables of fixed-size records, where the record formats were described by strings, with 'c' for a character, 'l' for a longword etc. The table positions and formats were written to an file, ready for later reuse. What I'm still missing is the next step, with more sophisticated types, and an ability to convert the fixed table offsets and counts into computed elements, as templates for viewing files of the same type. It may be more promising to extend that old tool with the capabilities, mentioned above, instead of formalizing all aspects of automated parsing of non-textual files. BTW, IDA is a fine example for such a tool DoDi |
|
#3
|
|||
|
|||
|
Tokenizer theory and practice
Dmitry A. Kazakov schrieb:
>Binary data typically comes in fixed length chunks, where the parser has >to provide the length and encoding of the next token. This procedure is >quite different from lexing text, that's why I prefer the term >"tokenizer" in this context, and bottom-up (LR) parsers seem not to be >very usable in this case. A lexer at best can skip over binary data, >what often is sufficient, but then it should not be confused by embedded >null "characters" in the input. > I don't really understand the problem, maybe, you can elaborate it. Why NUL is a problem and why tokens need to be "raw text." These are problems of many (C based) lexer generators. A conversion, into anything but an built-in data type, requires semantic code, which makes a formal grammar usable only with the language of that code. Few generators come with a built-in language for doing such conversions, like MetaS or TextTransformer. When I do similar stuff, I do it in a way that the parser returned typed objects rather than copies of the source. The whole idea to copy the source is bogus, IM Indeed, textual copies are of little use. Can you suggest a descriptive formalism for the objects, returned by an lexer? DoDi |
|
#4
|
|||
|
|||
|
Tokenizer theory and practice
cr88192 schrieb:
>I hope to separate the data structures, as syntactical elements, from >attributes etc. as kind of semantic code. In the best case it should be >possible to derive both an serializer and an de-serializer from a given >formal description. >> > ok. > so, one first describes all of the pieces, and then later how they are assembled? ok, but it may be difficult to pull off You are right. Looking at some popular binary file formats, the construction (writing) must be done sequentially, with possible patches of offsets in preceding tables. When an offset table is written last, it must be read in first, from the end of the file. I'm not sure whether a formal description of such procedures is possible, for use in both reading and writing such an file. An according grammar may become context sensitive, what classifies the problem as very interesting from the scientific point of view, but it's unlikely that it will result in a usable tool. as noted, I also said that there would be conditionals, that or we could also support BNF-based descriptions. > u64 uvli() { byte v; return((v&0x80)?(((v&0x7f)<<7)|uvli()):(v&0x7f)); }; > u64 svli() { uvli v; return((v&1)?(-((v+1)>>1)):(v>>1)); } > of course, this does not make it clear how to encode these types A common example would be UTF-8 encoding/decoding, which is easily described in a procedural way, and possibly also in a grammar, but deriving the code from such a grammar seems to exceed my capabilities. <sigh> your intent is for reverse engineering or something? that is what this sounds like to me at least That's the background, how I came to parsers at all ;-) usually for more complex or bulky formats, I write special tools After considering all the topics, mentioned in this thread, I better leave the theory to other people, too. As you stated before: >> more likely, one can end up more with what would amount to a format-design tool, than something that can actually reliably parse existing formats. << DoDi |
![]() |
| Viewing: Web Development Archives > FAQs > Compilers > Tokenizer theory and practice |
| Thread Tools | Search this Thread |
| Display Modes | Rate This Thread |
|
|
|
|