All Classes

NetBeans Architecture Answers for Lexer module

Author: mmetelka@netbeans.org
Answers as of: Apr 17, 2024
Answers for questions version: 1.29
Latest available version of questions: 1.29

Interfaces table

Group of java interfaces

Interface Name	In/Out	Stability	Specified in What Document?
LexerAPI	Exported	Official
org.netbeans.modules.editor.util	Imported	Private	The module is needed for compilation. The module is used during runtime. Specification version 1.30 is required.
UtilitiesAPI	Imported	Official	The module is needed for compilation. The module is used during runtime. Specification version 9.3 is required.
WeakListener.setAccessible	Imported	Under Development	The module is needed for compilation. The module is used during runtime. Specification version 9.3 is required.
LookupAPI	Imported	Official	The module is needed for compilation. The module is used during runtime. Specification version 8.0 is required.

Group of logger interfaces

Interface Name In/Out Stability Specified in What Document?

org.netbeans.lib.lexer.TokenHierarchyOperation

Exported

Friend

FINE level lists lexer changes made in tokens both at the root level and embedded levels of the token hierarchy after each document modification.
FINER level in addition will also check the whole token hierarchy for internal consistency after each modification.

org.netbeans.lib.lexer.TokenList

Exported

Friend

FINE level forces lexer to perform more thorough and strict checks in certain situations so this is useful mainly for tests. Lookahead and state information is generated even for batch-lexed inputs which allows easier checking of incremental algorithm correctness (fixing of token list after modification). There are also some additional checks performed that should verify correctness of the framework and the SPI implementation classes being used (for example when flyweight tokens are created the text passed to the token factory is compared to the text in the lexer input).

General Information

Question (arch-what): What is this project good for?

Answer:

Question (arch-overall): Describe the overall architecture.

Answer:

LexerAPI

API entry point

TokenHierarchy

Input Sources

TokenHierarchy can be created for immutable input sources ( CharSequence or java.io.Reader ) or for mutable input sources (typically javax.swing.text.Document ).
For mutable input source the lexer framework updates the tokens in the token hierarchy automatically with subsequent changes to the underlying text input. The tokens of the hierarchy always reflect the text of the input at the given time.

TokenSequence and Token

TokenHierarchy.tokenSequence() allows to iterate over a list of Token instances.
The token carries a token identification TokenId (returned by Token.id() ) and a text (aka token body) represented as CharSequence (returned by Token.text() ).
TokenUtilities contains many useful methods related to operations with the token's text such as TokenUtilities.equals(CharSequence text, Object o), TokenUtilities.startsWith(CharSequence text, CharSequence prefix), etc.
It is also possible to debug the text of the token (replace special chars by escapes) by TokenUtilities.equals(CharSequence text).
A typical token also carries offset of its occurrence in the input text.

Flyweight Tokens

As there are many token occurrences where the token text is the same for all or many occurrences (e.g. java keywords, operators or a single-space whitespace) the memory consumption can be decreased considerably by allowing the creation of flyweight token instances i.e. just one token instance is used for all the token's occurrences in all the inputs.
Flyweight tokens can be determined by Token.isFlyweight().
The flyweight tokens do not carry a valid offset (their internal offset is -1).
Therefore TokenSequence is used for iteration through the tokens (instead of a regular iterator) and it provides TokenSequence.offset() which returns the proper offset even when positioned over a flyweight token.
When holding a reference to the token's instance its offset can also be determined by Token.offset(TokenHierarchy tokenHierarchy). The tokenHierarchy parameter should be always null and it will be used for the token hierarchy snapshot support in future releases.
For flyweight tokens the Token.offset(TokenHierarchy tokenHierarchy) returns -1 and for regular tokens it gives the same value like TokenSequence.offset().

There may be applications where the flyweight tokens use could be problematic. For example if a parser would like to use token instances in a parse tree nodes to determine the nodes' boundaries then the flyweight tokens would always return offset -1 so the positions of the parse tree nodes could not generally be determined from the tokens only.
Therefore there is a possibility to de-flyweight a token by using TokenSequence.offsetToken() which checks the current token and if it's flyweight then it replaces it with a non-flyweight token instance with a valid offset and with the same properties as the original flyweight token.

TokenId and Language

Token is identified by its id represented by TokenId interface. Token ids for a language are typically implemented as java enums (extensions of Enum ) but it's not mandatory.
All token ids for the given language are described by Language.
Each token id may belong to one or more token categories that allow to better operate tokens of the same type (e.g. keywords or operators).
Each token id may define its primary category TokenId.primaryCategory() and LanguageHierarchy.createTokenCategories() may provide additional categories for the token ids for the given language.
Each language description has a mandatory mime-type specification Language.mimeType()
Although it's a bit non-related information it brings many benefits because with the mime-type the language can be accompanied with an arbitrary sort of settings (e.g. syntax coloring information etc.).

LanguageHierarchy, Lexer, LexerInput and TokenFactory

SPI providers wishing to provide a Language first need to define its SPI counterpart LanguageHierarchy. It mainly needs to define token ids in LanguageHierarchy.createTokenIds() and lexer in LanguageHierarchy.createLexer(LexerInput lexerInput, TokenFactory tokenFactory, Object state, LanguagePath languagePath, InputAttributes inputAttributes).
Lexer reads characters from LexerInput and breaks the text into tokens.
Tokens are produced by using methods of TokenFactory.
As a per-token memory consumption is critical the Token does not have any counterpart in SPI. However the framework prevents instantiation of any other token classes except those contained in the lexer module's implementation.

Language Embedding

With language embedding the flat list of tokens becomes in fact a tree-like hierarchy represented by the TokenHierarchy class. Each token can potentially be broken into a sequence of embedded tokens.
The TokenSequence.embedded() method can be called to obtain the embedded tokens (when positioned on the branch token).
There are two ways of specifying what language is embedded in a token. The language can either be specified explicitly (hardcoded) in the LanguageHierarchy.embedding() method or there can be a LanguageProvider registered in the default Lookup, which will create a Language for the embedded language.
There is no limit on the depth of a language hierarchy and there can be as many embedded languages as needed.
In SPI the language embedding is represented by LanguageEmbedding.

Question (arch-usecases): Describe the main use cases of the new API. Who will use it under what circumstances? What kind of code would typically need to be written to use the module?

Answer:

API Usecases

Obtaining of token hierarchy for various inputs.

TokenHierarchy

    String text = "public void m() { }";
    TokenHierarchy hi = TokenHierarchy.create(text, JavaLanguage.description());

    document.readLock();
    try {
        TokenHierarchy hi = TokenHierarchy.get(document);
        ... // explore tokens etc.
    } finally {
        document.readUnlock();
    }

Obtaining and iterating token sequence over particular swing document from the given offset.

    document.readLock();
    try {
        TokenHierarchy hi = TokenHierarchy.get(document);
        TokenSequence ts = hi.tokenSequence();
        // If necessary move ts to the requested offset
        ts.move(offset);
        while (ts.moveNext()) {
            Token t = ts.token();
            if (t.id() == ...) { ... }
            if (TokenUtilities.equals(t.text(), "mytext")) { ... }
            if (ts.offset() == ...) { ... }

            // Possibly retrieve embedded token sequence
            TokenSequence embedded = ts.embedded();
            if (embedded != null) { // Token has a valid language embedding
                ...
            }
        }
    } finally {
        document.readUnlock();
    }

Editor's painting code doing syntax coloring org.netbeans.modules.lexer.editorbridge.LexerLayer in lexer/editorbridge module.
Brace matching code searching for matching brace in forward/backward direction.
Code completion's quick check whether caret is located inside comment token.
Parser constructing a parse tree iterating through the tokens in forward direction.

Using language path of the token sequence

    TokenSequence ts = ...
    LanguagePath lp = ts.languagePath();
    if (lp.size() > 1) { ... } // This is embedded token sequence
    if (lp.topLanguage() == JavaLanguage.description()) { ... } // top-level language of the token hierarchy
    String mimePath = lp.mimePath();
    Object setting-value = some-settings.getSetting(mimePath, setting-name);

Extra information about the input

InputAttributes

public class MyLexer implements Lexer<MyTokenId> {
    
    private final int version;
    
    ...
    
    public MyLexer(LexerInput input, TokenFactory<MyTokenId> tokenFactory, Object state,
    LanguagePath languagePath, InputAttributes inputAttributes) {
        ...
        
        Integer ver = (inputAttributes != null)
                ? (Integer)inputAttributes.getValue(languagePath, "version")
                : null;
        this.version = (ver != null) ? ver.intValue() : 1; // Use version 1 if not specified explicitly
    }
    
    public Token<MyTokenId> nextToken() {
        ...
        if (recognized-assert-keyword) {
            return (version >= 4) { // "assert" recognized as keyword since version 4
                ? keyword(MyTokenId.ASSERT)
                : identifier();
        }
        ...
    }
    ...
}

    InputAttributes attrs = new InputAttributes();
    // The "true" means global value i.e. for any occurrence of the MyLanguage including embeddings
    attrs.setValue(MyLanguage.description(), "version", Integer.valueOf(3), true);
    TokenHierarchy hi = TokenHierarchy.create(text, false, SimpleLanguage.description(), null, attrs);
    ...

Filtering out unnecessary tokens

    Set<MyTokenId> skipIds = EnumSet.of(MyTokenId.COMMENT, MyTokenId.WHITESPACE);
    TokenHierarchy tokenHierarchy = TokenHierarchy.create(inputText, false,
        MyLanguage.description(), skipIds, null);
    ...

Parser constructing a parse tree. It is not interested in the comment and whitespace tokens so these tokens do not need to be constructed at all.

SPI Usecases

Providing language description and lexer.

org.netbeans.lib.lexer.test.simple.SimpleTokenId

org.netbeans.modules.lexer.editorbridge.calc.lang.CalcTokenId

language()

public enum CalcTokenId implements TokenId {

    WHITESPACE(null, "whitespace"),
    SL_COMMENT(null, "comment"),
    ML_COMMENT(null, "comment"),
    E("e", "keyword"),
    PI("pi", "keyword"),
    IDENTIFIER(null, null),
    INT_LITERAL(null, "number"),
    FLOAT_LITERAL(null, "number"),
    PLUS("+", "operator"),
    MINUS("-", "operator"),
    STAR("*", "operator"),
    SLASH("/", "operator"),
    LPAREN("(", "separator"),
    RPAREN(")", "separator"),
    ERROR(null, "error"),
    ML_COMMENT_INCOMPLETE(null, "comment");


    private final String fixedText;

    private final String primaryCategory;

    private CalcTokenId(String fixedText, String primaryCategory) {
        this.fixedText = fixedText;
        this.primaryCategory = primaryCategory;
    }
    
    public String fixedText() {
        return fixedText;
    }

    public String primaryCategory() {
        return primaryCategory;
    }

    private static final Language<CalcTokenId> language = new LanguageHierarchy<CalcTokenId>() {
        @Override
        protected Collection<CalcTokenId> createTokenIds() {
            return EnumSet.allOf(CalcTokenId.class);
        }
        
        @Override
        protected Map<String,Collection<CalcTokenId>> createTokenCategories() {
            Map<String,Collection<CalcTokenId>> cats = new HashMap<String,Collection<CalcTokenId>>();

            // Incomplete literals 
            cats.put("incomplete", EnumSet.of(CalcTokenId.ML_COMMENT_INCOMPLETE));
            // Additional literals being a lexical error
            cats.put("error", EnumSet.of(CalcTokenId.ML_COMMENT_INCOMPLETE));
            
            return cats;
        }

        @Override
        protected Lexer<CalcTokenId> createLexer(LexerRestartInfo<CalcTokenId> info) {
            return new CalcLexer(info);
        }

        @Override
        protected String mimeType() {
            return "text/x-calc";
        }
        
    }.language();

    public static final Language<CalcTokenId> language() {
        return language;
    }

}

LanguageHierarchy

public final class CalcLexer implements Lexer<CalcTokenId> {

    private static final int EOF = LexerInput.EOF;

    private static final Map<String,CalcTokenId> keywords = new HashMap<String,CalcTokenId>();
    static {
        keywords.put(CalcTokenId.E.fixedText(), CalcTokenId.E);
        keywords.put(CalcTokenId.PI.fixedText(), CalcTokenId.PI);
    }
    
    private LexerInput input;
    
    private TokenFactory<CalcTokenId> tokenFactory;

    CalcLexer(LexerRestartInfo<CalcTokenId> info) {
        this.input = info.input();
        this.tokenFactory = info.tokenFactory();
        assert (info.state() == null); // passed argument always null
    }
    
    public Token<CalcTokenId> nextToken() {
        while (true) {
            int ch = input.read();
            switch (ch) {
                case '+':
                    return token(CalcTokenId.PLUS);

                case '-':
                    return token(CalcTokenId.MINUS);

                case '*':
                    return token(CalcTokenId.STAR);

                case '/':
                    switch (input.read()) {
                        case '/': // in single-line comment
                            while (true)
                                switch (input.read()) {
                                    case '\r': input.consumeNewline();
                                    case '\n':
                                    case EOF:
                                        return token(CalcTokenId.SL_COMMENT);
                                }
                        case '*': // in multi-line comment
                            while (true) {
                                ch = input.read();
                                while (ch == '*') {
                                    ch = input.read();
                                    if (ch == '/')
                                        return token(CalcTokenId.ML_COMMENT);
                                    else if (ch == EOF)
                                        return token(CalcTokenId.ML_COMMENT_INCOMPLETE);
                                }
                                if (ch == EOF)
                                    return token(CalcTokenId.ML_COMMENT_INCOMPLETE);
                            }
                    }
                    input.backup(1);
                    return token(CalcTokenId.SLASH);

                case '(':
                    return token(CalcTokenId.LPAREN);

                case ')':
                    return token(CalcTokenId.RPAREN);

                case '0': case '1': case '2': case '3': case '4':
                case '5': case '6': case '7': case '8': case '9':
                case '.':
                    return finishIntOrFloatLiteral(ch);

                case EOF:
                    return null;

                default:
                    if (Character.isWhitespace((char)ch)) {
                        ch = input.read();
                        while (ch != EOF && Character.isWhitespace((char)ch)) {
                            ch = input.read();
                        }
                        input.backup(1);
                        return token(CalcTokenId.WHITESPACE);
                    }

                    if (Character.isLetter((char)ch)) { // identifier or keyword
                        while (true) {
                            if (ch == EOF || !Character.isLetter((char)ch)) {
                                input.backup(1); // backup the extra char (or EOF)
                                // Check for keywords
                                CalcTokenId id = keywords.get(input.readText());
                                if (id == null) {
                                    id = CalcTokenId.IDENTIFIER;
                                }
                                return token(id);
                            }
                            ch = input.read(); // read next char
                        }
                    }

                    return token(CalcTokenId.ERROR);
            }
        }
    }

    public Object state() {
        return null;
    }

    private Token<CalcTokenId> finishIntOrFloatLiteral(int ch) {
        boolean floatLiteral = false;
        boolean inExponent = false;
        while (true) {
            switch (ch) {
                case '.':
                    if (floatLiteral) {
                        return token(CalcTokenId.FLOAT_LITERAL);
                    } else {
                        floatLiteral = true;
                    }
                    break;
                case '0': case '1': case '2': case '3': case '4':
                case '5': case '6': case '7': case '8': case '9':
                    break;
                case 'e': case 'E': // exponent part
                    if (inExponent) {
                        return token(CalcTokenId.FLOAT_LITERAL);
                    } else {
                        floatLiteral = true;
                        inExponent = true;
                    }
                    break;
                default:
                    input.backup(1);
                    return token(floatLiteral ? CalcTokenId.FLOAT_LITERAL
                            : CalcTokenId.INT_LITERAL);
            }
            ch = input.read();
        }
    }
    
    private Token<CalcTokenId> token(CalcTokenId id) {
        return (id.fixedText() != null)
            ? tokenFactory.getFlyweightToken(id, id.fixedText())
            : tokenFactory.createToken(id);
    }

}

The classes containing token ids and the language description should be part of an API. The lexer should only be part of the implementation.

Providing language embedding.

LanguageHierarchy.embedding()

org.netbeans.lib.lexer.test.simple.SimpleLanguage

Or it may be provided dynamically through the xml layer by using a file in "Editors/language-mime-type/languagesEmbeddingMap" folder named by the token-id's name containing target mime-type and initial and ending skip lengths:

    <folder name="Editors">
        <folder name="text">
            <folder name="x-outer-language">
                <folder name="languagesEmbeddingMap">
                    <file name="WORD"><![CDATA[text/x-inner-language,1,2]]>
                    </file>
                </folder>
            </folder>
        </folder>
    </folder>

Question (arch-time): What are the time estimates of the work?

Answer:

Dynamic language embedding binding through xml layer.
CharPreprocessor servicing and tests.
Token hierarchy for Reader.
TokenFactory.createBranchToken() impl.
Providing JavaCC and Antlr support.
Support for token positions (may add API).

Question (arch-quality): How will the quality of your code be tested and how are future regressions going to be prevented?

Answer:

org.netbeans.lib.lexer.test.TestRandomModify

Question (arch-where): Where one can find sources for your module?

Answer:

The sources for the module are in the NetBeans Mercurial repositories.

Project and platform dependencies

Question (dep-nb): What other NetBeans projects and modules does this one depend on?

Answer:

These modules are required in project.xml:

org.netbeans.modules.editor.util - The module is needed for compilation. The module is used during runtime. Specification version 1.30 is required.
UtilitiesAPI - The module is needed for compilation. The module is used during runtime. Specification version 9.3 is required.
WeakListener.setAccessible - The module is needed for compilation. The module is used during runtime. Specification version 9.3 is required.
LookupAPI - The module is needed for compilation. The module is used during runtime. Specification version 8.0 is required.

Question (dep-non-nb): What other projects outside NetBeans does this one depend on?

Answer:

Question (dep-platform): On which platforms does your module run? Does it run in the same way on each?

Answer:

Question (dep-jre): Which version of JRE do you need (1.2, 1.3, 1.4, etc.)?

Answer:

Question (dep-jrejdk): Do you require the JDK or is the JRE enough?

Answer:

Deployment

Question (deploy-jar): Do you deploy just module JAR file(s) or other files as well?

Answer:

Question (deploy-nbm): Can you deploy an NBM via the Update Center?

Answer:

Question (deploy-shared): Do you need to be installed in the shared location only, or in the user directory only, or can your module be installed anywhere?

Answer:

Question (deploy-packages): Are packages of your module made inaccessible by not declaring them public?

Answer:

Question (deploy-dependencies): What do other modules need to do to declare a dependency on this one, in addition to or instead of the normal module dependency declaration (e.g. tokens to require)?

Answer:

OpenIDE-Module-Module-Dependencies: org.netbeans.modules.lexer/2 > @SPECIFICATION-VERSION@

Compatibility with environment

Question (compat-i18n): Is your module correctly internationalized?

Answer:

Question (compat-standards): Does the module implement or define any standards? Is the implementation exact or does it deviate somehow?

Answer:

Question (compat-version): Can your module coexist with earlier and future versions of itself? Can you correctly read all old settings? Will future versions be able to read your current settings? Can you read or politely ignore settings stored by a future version?

Answer:

Question (compat-deprecation): How the introduction of your project influences functionality provided by previous version of the product?

Answer:

The current API completely replaces the original one therefore the major version of the module was increased from 1 to 2.
There are no plans to deprecated any part of the present API and it should be evolved in a compatible way.

Access to resources

Question (resources-file): Does your module use java.io.File directly?

Answer:

Question (resources-layer): Does your module provide own layer? Does it create any files or folders in it? What it is trying to communicate by that and with which components?

Answer:

Question (resources-read): Does your module read any resources from layers? For what purpose?

Answer:

Question (resources-mask): Does your module mask/hide/override any resources provided by other modules in their layers?

Answer:

Question (resources-preferences): Does your module uses preferences via Preferences API? Does your module use NbPreferences or or regular JDK Preferences ? Does it read, write or both ? Does it share preferences with other modules ? If so, then why ?

Answer:

No.

Lookup of components

Question (lookup-lookup): Does your module use org.openide.util.Lookup or any similar technology to find any components to communicate with? Which ones?

Answer:

Question (lookup-register): Do you register anything into lookup for other code to find?

Answer:

Question (lookup-remove): Do you remove entries of other modules from lookup?

Answer:

Execution Environment

Question (exec-property): Is execution of your code influenced by any environment or Java system (System.getProperty) property? On a similar note, is there something interesting that you pass to java.util.logging.Logger? Or do you observe what others log?

Answer:

org.netbeans.lib.lexer.TokenHierarchyOperation

FINE

FINER

org.netbeans.lib.lexer.TokenList

FINE

Question (exec-component): Is execution of your code influenced by any (string) property of any of your components?

Answer:

Question (exec-ant-tasks): Do you define or register any ant tasks that other can use?

Answer:

Question (exec-classloader): Does your code create its own class loader(s)?

Answer:

Question (exec-reflection): Does your code use Java Reflection to execute other code?

Answer:

Question (exec-privateaccess): Are you aware of any other parts of the system calling some of your methods by reflection?

Answer:

Question (exec-process): Do you execute an external process from your module? How do you ensure that the result is the same on different platforms? Do you parse output? Do you depend on result code?

Answer:

Question (exec-introspection): Does your module use any kind of runtime type information (instanceof, work with java.lang.Class, etc.)?

Answer:

Question (exec-threading): What threading models, if any, does your module adhere to? How the project behaves with respect to threading?

Answer:

Question (security-policy): Does your functionality require modifications to the standard policy file?

Answer:

Question (security-grant): Does your code grant additional rights to some other code?

Answer:

Format of files and protocols

Question (format-types): Which protocols and file formats (if any) does your module read or write on disk, or transmit or receive over the network? Do you generate an ant build script? Can it be edited and modified?

Answer:

Question (format-dnd): Which protocols (if any) does your code understand during Drag & Drop?

Answer:

Question (format-clipboard): Which data flavors (if any) does your code read from or insert to the clipboard (by access to clipboard on means calling methods on java.awt.datatransfer.Transferable?

Answer:

Performance and Scalability

Question (perf-startup): Does your module run any code on startup?

Answer:

Question (perf-exit): Does your module run any code on exit?

Answer:

Question (perf-scale): Which external criteria influence the performance of your program (size of file in editor, number of files in menu, in source directory, etc.) and how well your code scales?

Answer:

Question (perf-limit): Are there any hard-coded or practical limits in the number or size of elements your code can handle?

Answer:

Question (perf-mem): How much memory does your component consume? Estimate with a relation to the number of windows, etc.

Answer:

DefaultToken: 24 bytes
StringToken: 32 bytes (but only used for flyweight tokens)
PrepToken: 32 bytes plus text storage size (but only used for tokens where character preprocessing was necessary)

Question (perf-wakeup): Does any piece of your code wake up periodically and do something even when the system is otherwise idle (no user interaction)?

Answer:

Question (perf-progress): Does your module execute any long-running tasks?

Answer:

Question (perf-huge_dialogs): Does your module contain any dialogs or wizards with a large number of GUI controls such as combo boxes, lists, trees, or text areas?

Answer:

Question (perf-menus): Does your module use dynamically updated context menus, or context-sensitive actions with complicated and slow enablement logic?

Answer:

Question (perf-spi): How the performance of the plugged in code will be enforced?

Answer: