Lexer
Package	Description
org.netbeans.api.lexer	The entrance point into Lexer API is `TokenHierarchy` class with its static methods that provide its instance for the given input source.
org.netbeans.spi.lexer	The main abstract class in the Lexer SPI that must be implemented is `LanguageHierarchy` that mainly defines set of token ids and token categories for the new language and its `Lexer`.

The lexer module defines LexerAPI providing access to sequence of tokens for various input sources.
An API entry point is TokenHierarchy class with its static methods that provide its instance for the given input source.

Input Sources

TokenHierarchy can be created for immutable input sources ( CharSequence or java.io.Reader ) or for mutable input sources (typically javax.swing.text.Document ).
For mutable input source the lexer framework updates the tokens in the token hierarchy automatically with subsequent changes to the underlying text input. The tokens of the hierarchy always reflect the text of the input at the given time.

TokenSequence and Token

TokenHierarchy.tokenSequence() allows to iterate over a list of Token instances.
The token carries a token identification TokenId (returned by Token.id() ) and a text (aka token body) represented as CharSequence (returned by Token.text() ).
TokenUtilities contains many useful methods related to operations with the token's text such as TokenUtilities.equals(CharSequence text, Object o), TokenUtilities.startsWith(CharSequence text, CharSequence prefix), etc.
It is also possible to debug the text of the token (replace special chars by escapes) by TokenUtilities.equals(CharSequence text).
A typical token also carries offset of its occurrence in the input text.

Flyweight Tokens

As there are many token occurrences where the token text is the same for all or many occurrences (e.g. java keywords, operators or a single-space whitespace) the memory consumption can be decreased considerably by allowing the creation of flyweight token instances i.e. just one token instance is used for all the token's occurrences in all the inputs.
Flyweight tokens can be determined by Token.isFlyweight().
The flyweight tokens do not carry a valid offset (their internal offset is -1).
Therefore TokenSequence is used for iteration through the tokens (instead of a regular iterator) and it provides TokenSequence.offset() which returns the proper offset even when positioned over a flyweight token.
When holding a reference to the token's instance its offset can also be determined by Token.offset(TokenHierarchy tokenHierarchy). The tokenHierarchy parameter should be always null and it will be used for the token hierarchy snapshot support in future releases.
For flyweight tokens the Token.offset(TokenHierarchy tokenHierarchy) returns -1 and for regular tokens it gives the same value like TokenSequence.offset().

There may be applications where the flyweight tokens use could be problematic. For example if a parser would like to use token instances in a parse tree nodes to determine the nodes' boundaries then the flyweight tokens would always return offset -1 so the positions of the parse tree nodes could not generally be determined from the tokens only.
Therefore there is a possibility to de-flyweight a token by using TokenSequence.offsetToken() which checks the current token and if it's flyweight then it replaces it with a non-flyweight token instance with a valid offset and with the same properties as the original flyweight token.

TokenId and Language

Token is identified by its id represented by TokenId interface. Token ids for a language are typically implemented as java enums (extensions of Enum ) but it's not mandatory.
All token ids for the given language are described by Language.
Each token id may belong to one or more token categories that allow to better operate tokens of the same type (e.g. keywords or operators).
Each token id may define its primary category TokenId.primaryCategory() and LanguageHierarchy.createTokenCategories() may provide additional categories for the token ids for the given language.
Each language description has a mandatory mime-type specification Language.mimeType()
Although it's a bit non-related information it brings many benefits because with the mime-type the language can be accompanied with an arbitrary sort of settings (e.g. syntax coloring information etc.).

LanguageHierarchy, Lexer, LexerInput and TokenFactory

SPI providers wishing to provide a Language first need to define its SPI counterpart LanguageHierarchy. It mainly needs to define token ids in LanguageHierarchy.createTokenIds() and lexer in LanguageHierarchy.createLexer(LexerInput lexerInput, TokenFactory tokenFactory, Object state, LanguagePath languagePath, InputAttributes inputAttributes).
Lexer reads characters from LexerInput and breaks the text into tokens.
Tokens are produced by using methods of TokenFactory.
As a per-token memory consumption is critical the Token does not have any counterpart in SPI. However the framework prevents instantiation of any other token classes except those contained in the lexer module's implementation.

Language Embedding

With language embedding the flat list of tokens becomes in fact a tree-like hierarchy represented by the TokenHierarchy class. Each token can potentially be broken into a sequence of embedded tokens.
The TokenSequence.embedded() method can be called to obtain the embedded tokens (when positioned on the branch token).
There are two ways of specifying what language is embedded in a token. The language can either be specified explicitly (hardcoded) in the LanguageHierarchy.embedding() method or there can be a LanguageProvider registered in the default Lookup, which will create a Language for the embedded language.
There is no limit on the depth of a language hierarchy and there can be as many embedded languages as needed.
In SPI the language embedding is represented by LanguageEmbedding.

What is New (see all changes)?

May 28 '08 Joined Sections Lexing

Embeddings that request input sections to be joined before lexing are now lexed as a single section.
Token.isRemoved() was added to check whether a particular token is still present in token hierarchy or whether it was removed as part of a modification.
Nov 5 '07 Lexer API Cleanup

Support for token hierarchy snapshots and generic character preprocessing was removed from the API and SPI since there were no usecases yet and it should be possible to add the functionality later in a backward compatible way. Some more changes regarding generification etc. were performed.
Nov 3 '07 Removing LexerInput.integerState

LexerInput.integerState() was removed.
Oct 5 '07 Possibility to remove a previously created custom embedding

TokenSequence.removeEmbedding() was added as counterpart to TokenSequence.createEmbedding().
Also TokenSequence.isValid() was added to check whether the token sequence can be used for iteration (no modifications of the underlying input in the meantime).
Aug 30 '07 Support for embeddings that join sections

Joining sections embeddings now supported and some minor changes were introduced like adding LanguagePath.parent(). Some is* methods with trivial implementations were removed from LanguagePath. TokenChange.embeddedChange(Language) was removed because there might be multiple such changes and they can be gathered with existing methods.

Use Cases

API Usecases

Obtaining of token hierarchy for various inputs.

The TokenHierarchy is an entry point into Lexer API and it represents the given input in terms of tokens.

    String text = "public void m() { }";
    TokenHierarchy hi = TokenHierarchy.create(text, JavaLanguage.description());

Token hierarchy for swing documents must be operated under read/write document's lock.

    document.readLock();
    try {
        TokenHierarchy hi = TokenHierarchy.get(document);
        ... // explore tokens etc.
    } finally {
        document.readUnlock();
    }

Obtaining and iterating token sequence over particular swing document from the given offset.

The tokens cover the whole document and it's possible to iterate either forward or backward.
Each token can contain language embedding that can also be explored by the token sequence. The language embedding covers the whole text of the token (there can be few characters skipped at the begining an end of the branch token).

    document.readLock();
    try {
        TokenHierarchy hi = TokenHierarchy.get(document);
        TokenSequence ts = hi.tokenSequence();
        // If necessary move ts to the requested offset
        ts.move(offset);
        while (ts.moveNext()) {
            Token t = ts.token();
            if (t.id() == ...) { ... }
            if (TokenUtilities.equals(t.text(), "mytext")) { ... }
            if (ts.offset() == ...) { ... }

            // Possibly retrieve embedded token sequence
            TokenSequence embedded = ts.embedded();
            if (embedded != null) { // Token has a valid language embedding
                ...
            }
        }
    } finally {
        document.readUnlock();
    }

Typical clients:

Editor's painting code doing syntax coloring org.netbeans.modules.lexer.editorbridge.LexerLayer in lexer/editorbridge module.
Brace matching code searching for matching brace in forward/backward direction.
Code completion's quick check whether caret is located inside comment token.
Parser constructing a parse tree iterating through the tokens in forward direction.

Using language path of the token sequence

For the given token sequence the client may check whether it's a top level token sequence in the token hierarchy or whether it's embedded at which level it's embedded and what are the parent languages.
Each token can contain language embedding that can also be explored by the token sequence. The language embedding covers the whole text of the token (there can be few characters skipped at the begining an end of the branch token).

    TokenSequence ts = ...
    LanguagePath lp = ts.languagePath();
    if (lp.size() > 1) { ... } // This is embedded token sequence
    if (lp.topLanguage() == JavaLanguage.description()) { ... } // top-level language of the token hierarchy
    String mimePath = lp.mimePath();
    Object setting-value = some-settings.getSetting(mimePath, setting-name);

Extra information about the input

The InputAttributes class may carry extra information about the text input on which the token hierarchy is being created. For example there can be information about the version of the language that the input represents and the lexer may be written to recognize multiple versions of the language. It should suffice to do the versioning through a simple integer:

public class MyLexer implements Lexer<MyTokenId> {
    
    private final int version;
    
    ...
    
    public MyLexer(LexerInput input, TokenFactory<MyTokenId> tokenFactory, Object state,
    LanguagePath languagePath, InputAttributes inputAttributes) {
        ...
        
        Integer ver = (inputAttributes != null)
                ? (Integer)inputAttributes.getValue(languagePath, "version")
                : null;
        this.version = (ver != null) ? ver.intValue() : 1; // Use version 1 if not specified explicitly
    }
    
    public Token<MyTokenId> nextToken() {
        ...
        if (recognized-assert-keyword) {
            return (version >= 4) { // "assert" recognized as keyword since version 4
                ? keyword(MyTokenId.ASSERT)
                : identifier();
        }
        ...
    }
    ...
}

The client will then use the following code:

    InputAttributes attrs = new InputAttributes();
    // The "true" means global value i.e. for any occurrence of the MyLanguage including embeddings
    attrs.setValue(MyLanguage.description(), "version", Integer.valueOf(3), true);
    TokenHierarchy hi = TokenHierarchy.create(text, false, SimpleLanguage.description(), null, attrs);
    ...

Filtering out unnecessary tokens

Filtering is only possible for immutable inputs (e.g. String or Reader).

    Set<MyTokenId> skipIds = EnumSet.of(MyTokenId.COMMENT, MyTokenId.WHITESPACE);
    TokenHierarchy tokenHierarchy = TokenHierarchy.create(inputText, false,
        MyLanguage.description(), skipIds, null);
    ...

Typical clients:

Parser constructing a parse tree. It is not interested in the comment and whitespace tokens so these tokens do not need to be constructed at all.

SPI Usecases

Providing language description and lexer.

Token ids should be defined as enums. For example org.netbeans.lib.lexer.test.simple.SimpleTokenId can be copied or the following example from org.netbeans.modules.lexer.editorbridge.calc.lang.CalcTokenId.
The static language() method returns the language describing the token ids.

public enum CalcTokenId implements TokenId {

    WHITESPACE(null, "whitespace"),
    SL_COMMENT(null, "comment"),
    ML_COMMENT(null, "comment"),
    E("e", "keyword"),
    PI("pi", "keyword"),
    IDENTIFIER(null, null),
    INT_LITERAL(null, "number"),
    FLOAT_LITERAL(null, "number"),
    PLUS("+", "operator"),
    MINUS("-", "operator"),
    STAR("*", "operator"),
    SLASH("/", "operator"),
    LPAREN("(", "separator"),
    RPAREN(")", "separator"),
    ERROR(null, "error"),
    ML_COMMENT_INCOMPLETE(null, "comment");


    private final String fixedText;

    private final String primaryCategory;

    private CalcTokenId(String fixedText, String primaryCategory) {
        this.fixedText = fixedText;
        this.primaryCategory = primaryCategory;
    }
    
    public String fixedText() {
        return fixedText;
    }

    public String primaryCategory() {
        return primaryCategory;
    }

    private static final Language<CalcTokenId> language = new LanguageHierarchy<CalcTokenId>() {
        @Override
        protected Collection<CalcTokenId> createTokenIds() {
            return EnumSet.allOf(CalcTokenId.class);
        }
        
        @Override
        protected Map<String,Collection<CalcTokenId>> createTokenCategories() {
            Map<String,Collection<CalcTokenId>> cats = new HashMap<String,Collection<CalcTokenId>>();

            // Incomplete literals 
            cats.put("incomplete", EnumSet.of(CalcTokenId.ML_COMMENT_INCOMPLETE));
            // Additional literals being a lexical error
            cats.put("error", EnumSet.of(CalcTokenId.ML_COMMENT_INCOMPLETE));
            
            return cats;
        }

        @Override
        protected Lexer<CalcTokenId> createLexer(LexerRestartInfo<CalcTokenId> info) {
            return new CalcLexer(info);
        }

        @Override
        protected String mimeType() {
            return "text/x-calc";
        }
        
    }.language();

    public static final Language<CalcTokenId> language() {
        return language;
    }

}

Note that it is not needed to publish the underlying LanguageHierarchy extension.
Lexer example:

public final class CalcLexer implements Lexer<CalcTokenId> {

    private static final int EOF = LexerInput.EOF;

    private static final Map<String,CalcTokenId> keywords = new HashMap<String,CalcTokenId>();
    static {
        keywords.put(CalcTokenId.E.fixedText(), CalcTokenId.E);
        keywords.put(CalcTokenId.PI.fixedText(), CalcTokenId.PI);
    }
    
    private LexerInput input;
    
    private TokenFactory<CalcTokenId> tokenFactory;

    CalcLexer(LexerRestartInfo<CalcTokenId> info) {
        this.input = info.input();
        this.tokenFactory = info.tokenFactory();
        assert (info.state() == null); // passed argument always null
    }
    
    public Token<CalcTokenId> nextToken() {
        while (true) {
            int ch = input.read();
            switch (ch) {
                case '+':
                    return token(CalcTokenId.PLUS);

                case '-':
                    return token(CalcTokenId.MINUS);

                case '*':
                    return token(CalcTokenId.STAR);

                case '/':
                    switch (input.read()) {
                        case '/': // in single-line comment
                            while (true)
                                switch (input.read()) {
                                    case '\r': input.consumeNewline();
                                    case '\n':
                                    case EOF:
                                        return token(CalcTokenId.SL_COMMENT);
                                }
                        case '*': // in multi-line comment
                            while (true) {
                                ch = input.read();
                                while (ch == '*') {
                                    ch = input.read();
                                    if (ch == '/')
                                        return token(CalcTokenId.ML_COMMENT);
                                    else if (ch == EOF)
                                        return token(CalcTokenId.ML_COMMENT_INCOMPLETE);
                                }
                                if (ch == EOF)
                                    return token(CalcTokenId.ML_COMMENT_INCOMPLETE);
                            }
                    }
                    input.backup(1);
                    return token(CalcTokenId.SLASH);

                case '(':
                    return token(CalcTokenId.LPAREN);

                case ')':
                    return token(CalcTokenId.RPAREN);

                case '0': case '1': case '2': case '3': case '4':
                case '5': case '6': case '7': case '8': case '9':
                case '.':
                    return finishIntOrFloatLiteral(ch);

                case EOF:
                    return null;

                default:
                    if (Character.isWhitespace((char)ch)) {
                        ch = input.read();
                        while (ch != EOF && Character.isWhitespace((char)ch)) {
                            ch = input.read();
                        }
                        input.backup(1);
                        return token(CalcTokenId.WHITESPACE);
                    }

                    if (Character.isLetter((char)ch)) { // identifier or keyword
                        while (true) {
                            if (ch == EOF || !Character.isLetter((char)ch)) {
                                input.backup(1); // backup the extra char (or EOF)
                                // Check for keywords
                                CalcTokenId id = keywords.get(input.readText());
                                if (id == null) {
                                    id = CalcTokenId.IDENTIFIER;
                                }
                                return token(id);
                            }
                            ch = input.read(); // read next char
                        }
                    }

                    return token(CalcTokenId.ERROR);
            }
        }
    }

    public Object state() {
        return null;
    }

    private Token<CalcTokenId> finishIntOrFloatLiteral(int ch) {
        boolean floatLiteral = false;
        boolean inExponent = false;
        while (true) {
            switch (ch) {
                case '.':
                    if (floatLiteral) {
                        return token(CalcTokenId.FLOAT_LITERAL);
                    } else {
                        floatLiteral = true;
                    }
                    break;
                case '0': case '1': case '2': case '3': case '4':
                case '5': case '6': case '7': case '8': case '9':
                    break;
                case 'e': case 'E': // exponent part
                    if (inExponent) {
                        return token(CalcTokenId.FLOAT_LITERAL);
                    } else {
                        floatLiteral = true;
                        inExponent = true;
                    }
                    break;
                default:
                    input.backup(1);
                    return token(floatLiteral ? CalcTokenId.FLOAT_LITERAL
                            : CalcTokenId.INT_LITERAL);
            }
            ch = input.read();
        }
    }
    
    private Token<CalcTokenId> token(CalcTokenId id) {
        return (id.fixedText() != null)
            ? tokenFactory.getFlyweightToken(id, id.fixedText())
            : tokenFactory.createToken(id);
    }

}

The classes containing token ids and the language description should be part of an API. The lexer should only be part of the implementation.

Providing language embedding.

The embedding may be provided statically in the LanguageHierarchy.embedding() see e.g. org.netbeans.lib.lexer.test.simple.SimpleLanguage.

Or it may be provided dynamically through the xml layer by using a file in "Editors/language-mime-type/languagesEmbeddingMap" folder named by the token-id's name containing target mime-type and initial and ending skip lengths:

    <folder name="Editors">
        <folder name="text">
            <folder name="x-outer-language">
                <folder name="languagesEmbeddingMap">
                    <file name="WORD"><![CDATA[text/x-inner-language,1,2]]>
                    </file>
                </folder>
            </folder>
        </folder>
    </folder>

Exported Interfaces

This table lists all of the module exported APIs with defined stability classifications. It is generated based on answers to questions about the architecture of the module. Read them all...

Group of java interfaces

Interface Name	In/Out	Stability	Specified in What Document?
LexerAPI	Exported	Official

Group of logger interfaces

Interface Name In/Out Stability Specified in What Document?

org.netbeans.lib.lexer.TokenHierarchyOperation

Exported

Friend

FINE level lists lexer changes made in tokens both at the root level and embedded levels of the token hierarchy after each document modification.
FINER level in addition will also check the whole token hierarchy for internal consistency after each modification.

org.netbeans.lib.lexer.TokenList

Exported

Friend

FINE level forces lexer to perform more thorough and strict checks in certain situations so this is useful mainly for tests. Lookahead and state information is generated even for batch-lexed inputs which allows easier checking of incremental algorithm correctness (fixing of token list after modification). There are also some additional checks performed that should verify correctness of the framework and the SPI implementation classes being used (for example when flyweight tokens are created the text passed to the token factory is compared to the text in the lexer input).

Implementation Details

Where are the sources for the module?

The sources for the module are in the NetBeans Mercurial repositories.

What do other modules need to do to declare a dependency on this one, in addition to or instead of a plain module dependency?

OpenIDE-Module-Module-Dependencies: org.netbeans.modules.lexer/2 > @SPECIFICATION-VERSION@

Read more about the implementation in the answers to architecture questions.

LexerOfficial

Input Sources

TokenSequence and Token

Flyweight Tokens

TokenId and Language

LanguageHierarchy, Lexer, LexerInput and TokenFactory

Language Embedding

What is New (see all changes)?

Use Cases

API Usecases

Obtaining of token hierarchy for various inputs.

Obtaining and iterating token sequence over particular swing document from the given offset.

Using language path of the token sequence

Extra information about the input

Filtering out unnecessary tokens

SPI Usecases

Providing language description and lexer.

Providing language embedding.

Exported Interfaces

Group of java interfaces

Group of logger interfaces

Implementation Details

Where are the sources for the module?

What do other modules need to do to declare a dependency on this one, in addition to or instead of a plain module dependency?

Lexer
Official