See: Description
Package | Description |
---|---|
org.netbeans.api.lexer |
The entrance point into Lexer API is
TokenHierarchy class with
its static methods that provide its instance for the given input source. |
org.netbeans.spi.lexer |
The main abstract class in the Lexer SPI that must be implemented
is
LanguageHierarchy that mainly defines
set of token ids and token categories for the new language
and its Lexer . |
The lexer module defines
LexerAPI
providing access to sequence of tokens for various input sources.
An API entry point is
TokenHierarchy
class with its static methods that provide its instance for the given input source.
TokenHierarchy
can be created for immutable input sources (
CharSequence
or
java.io.Reader
) or for mutable input sources (typically
javax.swing.text.Document
).
For mutable input source the lexer framework updates the tokens in the token hierarchy automatically
with subsequent changes to the underlying text input.
The tokens of the hierarchy always reflect the text of the input at the given time.
TokenHierarchy.tokenSequence()
allows to iterate over a list of
Token
instances.
The token carries a token identification
TokenId
(returned by
Token.id()
) and a text (aka token body) represented as
CharSequence
(returned by
Token.text()
).
TokenUtilities
contains many useful methods related to operations with the token's text such as
TokenUtilities.equals(CharSequence text, Object o),
TokenUtilities.startsWith(CharSequence text, CharSequence prefix),
etc.
It is also possible to debug the text of the token (replace special chars by escapes) by
TokenUtilities.equals(CharSequence text).
A typical token also carries offset of its occurrence in the input text.
As there are many token occurrences where the token text is the same for all
or many occurrences
(e.g. java keywords, operators or a single-space whitespace) the memory consumption
can be decreased considerably by allowing the creation of flyweight token instances
i.e. just one token instance is used for all the token's occurrences
in all the inputs.
Flyweight tokens can be determined by
Token.isFlyweight().
The flyweight tokens do not carry a valid offset (their internal offset is -1).
Therefore
TokenSequence
is used for iteration through the tokens (instead of a regular iterator) and it provides
TokenSequence.offset()
which returns the proper offset even when positioned over a flyweight token.
When holding a reference to the token's instance its offset can also be determined by
Token.offset(TokenHierarchy tokenHierarchy).
The tokenHierarchy
parameter should be always null
and it will be used
for the token hierarchy snapshot support in future releases.
For flyweight tokens the
Token.offset(TokenHierarchy tokenHierarchy)
returns -1 and for regular tokens it gives the same value like
TokenSequence.offset().
There may be applications where the flyweight tokens use could be problematic.
For example if a parser would like to use token instances
in a parse tree nodes to determine the nodes' boundaries then the flyweight tokens
would always return offset -1 so the positions of the parse tree nodes
could not generally be determined from the tokens only.
Therefore there is a possibility to de-flyweight a token by using
TokenSequence.offsetToken()
which checks the current token
and if it's flyweight then it replaces it with a non-flyweight token instance
with a valid offset and with the same properties as the original flyweight token.
Token is identified by its id represented by
TokenId
interface. Token ids for a language are typically implemented as java enums (extensions of
Enum
) but it's not mandatory.
All token ids for the given language are described by
Language.
Each token id may belong
to one or more token categories that allow to better operate
tokens of the same type (e.g. keywords or operators).
Each token id may define its primary category
TokenId.primaryCategory()
and
LanguageHierarchy.createTokenCategories()
may provide additional categories for the token ids for the given language.
Each language description has a mandatory mime-type specification
Language.mimeType()
Although it's a bit non-related information it brings many benefits
because with the mime-type the language can be accompanied
with an arbitrary sort of settings (e.g. syntax coloring information etc.).
SPI providers wishing to provide a
Language
first need to define its SPI counterpart
LanguageHierarchy.
It mainly needs to define token ids in
LanguageHierarchy.createTokenIds()
and lexer in
LanguageHierarchy.createLexer(LexerInput lexerInput, TokenFactory tokenFactory, Object state, LanguagePath languagePath, InputAttributes inputAttributes).
Lexer
reads characters from
LexerInput
and breaks the text into tokens.
Tokens are produced by using methods of
TokenFactory.
As a per-token memory consumption is critical the
Token
does not have any counterpart in SPI. However the framework prevents instantiation
of any other token classes except those contained in the lexer module's implementation.
With language embedding the flat list of tokens becomes in fact a tree-like hierarchy
represented by the
TokenHierarchy
class. Each token can potentially be broken into a sequence of embedded tokens.
The
TokenSequence.embedded()
method can be called to obtain the embedded tokens (when positioned on the branch token).
There are two ways of specifying what language is embedded in a token. The language
can either be specified explicitly (hardcoded) in the
LanguageHierarchy.embedding()
method or there can be a
LanguageProvider
registered in the default Lookup, which will create a
Language
for the embedded language.
There is no limit on the depth of a language hierarchy and there can be as many embedded languages
as needed.
In SPI the language embedding is represented by
LanguageEmbedding.
Embeddings that request input sections to be joined before lexing
are now lexed as a single section.
Token.isRemoved()
was added to check whether a particular token
is still present in token hierarchy or whether it was removed as part of a modification.
Support for token hierarchy snapshots and generic character preprocessing was removed from the API and SPI since there were no usecases yet and it should be possible to add the functionality later in a backward compatible way. Some more changes regarding generification etc. were performed.
LexerInput.integerState()
was removed.
TokenSequence.removeEmbedding()
was added as counterpart
to TokenSequence.createEmbedding()
.
Also TokenSequence.isValid()
was added to check whether
the token sequence can be used for iteration (no modifications
of the underlying input in the meantime).
Joining sections embeddings now supported and some minor changes
were introduced like adding LanguagePath.parent()
.
Some is* methods with trivial implementations were removed from LanguagePath.
TokenChange.embeddedChange(Language)
was removed because
there might be multiple such changes and they can be gathered
with existing methods.
String text = "public void m() { }"; TokenHierarchy hi = TokenHierarchy.create(text, JavaLanguage.description());
document.readLock(); try { TokenHierarchy hi = TokenHierarchy.get(document); ... // explore tokens etc. } finally { document.readUnlock(); }
document.readLock(); try { TokenHierarchy hi = TokenHierarchy.get(document); TokenSequence ts = hi.tokenSequence(); // If necessary move ts to the requested offset ts.move(offset); while (ts.moveNext()) { Token t = ts.token(); if (t.id() == ...) { ... } if (TokenUtilities.equals(t.text(), "mytext")) { ... } if (ts.offset() == ...) { ... } // Possibly retrieve embedded token sequence TokenSequence embedded = ts.embedded(); if (embedded != null) { // Token has a valid language embedding ... } } } finally { document.readUnlock(); }
org.netbeans.modules.lexer.editorbridge.LexerLayer
in lexer/editorbridge module.
TokenSequence ts = ... LanguagePath lp = ts.languagePath(); if (lp.size() > 1) { ... } // This is embedded token sequence if (lp.topLanguage() == JavaLanguage.description()) { ... } // top-level language of the token hierarchy String mimePath = lp.mimePath(); Object setting-value = some-settings.getSetting(mimePath, setting-name);
public class MyLexer implements Lexer<MyTokenId> { private final int version; ... public MyLexer(LexerInput input, TokenFactory<MyTokenId> tokenFactory, Object state, LanguagePath languagePath, InputAttributes inputAttributes) { ... Integer ver = (inputAttributes != null) ? (Integer)inputAttributes.getValue(languagePath, "version") : null; this.version = (ver != null) ? ver.intValue() : 1; // Use version 1 if not specified explicitly } public Token<MyTokenId> nextToken() { ... if (recognized-assert-keyword) { return (version >= 4) { // "assert" recognized as keyword since version 4 ? keyword(MyTokenId.ASSERT) : identifier(); } ... } ... }The client will then use the following code:
InputAttributes attrs = new InputAttributes(); // The "true" means global value i.e. for any occurrence of the MyLanguage including embeddings attrs.setValue(MyLanguage.description(), "version", Integer.valueOf(3), true); TokenHierarchy hi = TokenHierarchy.create(text, false, SimpleLanguage.description(), null, attrs); ...
Set<MyTokenId> skipIds = EnumSet.of(MyTokenId.COMMENT, MyTokenId.WHITESPACE); TokenHierarchy tokenHierarchy = TokenHierarchy.create(inputText, false, MyLanguage.description(), skipIds, null); ...
org.netbeans.lib.lexer.test.simple.SimpleTokenId
can be copied
or the following example from
org.netbeans.modules.lexer.editorbridge.calc.lang.CalcTokenId
.
language()
method returns the language describing the token ids.
public enum CalcTokenId implements TokenId { WHITESPACE(null, "whitespace"), SL_COMMENT(null, "comment"), ML_COMMENT(null, "comment"), E("e", "keyword"), PI("pi", "keyword"), IDENTIFIER(null, null), INT_LITERAL(null, "number"), FLOAT_LITERAL(null, "number"), PLUS("+", "operator"), MINUS("-", "operator"), STAR("*", "operator"), SLASH("/", "operator"), LPAREN("(", "separator"), RPAREN(")", "separator"), ERROR(null, "error"), ML_COMMENT_INCOMPLETE(null, "comment"); private final String fixedText; private final String primaryCategory; private CalcTokenId(String fixedText, String primaryCategory) { this.fixedText = fixedText; this.primaryCategory = primaryCategory; } public String fixedText() { return fixedText; } public String primaryCategory() { return primaryCategory; } private static final Language<CalcTokenId> language = new LanguageHierarchy<CalcTokenId>() {Note that it is not needed to publish the underlying@Override
protected Collection<CalcTokenId> createTokenIds() { return EnumSet.allOf(CalcTokenId.class); }@Override
protected Map<String,Collection<CalcTokenId>> createTokenCategories() { Map<String,Collection<CalcTokenId>> cats = new HashMap<String,Collection<CalcTokenId>>(); // Incomplete literals cats.put("incomplete", EnumSet.of(CalcTokenId.ML_COMMENT_INCOMPLETE)); // Additional literals being a lexical error cats.put("error", EnumSet.of(CalcTokenId.ML_COMMENT_INCOMPLETE)); return cats; }@Override
protected Lexer<CalcTokenId> createLexer(LexerRestartInfo<CalcTokenId> info) { return new CalcLexer(info); }@Override
protected String mimeType() { return "text/x-calc"; } }.language(); public static final Language<CalcTokenId> language() { return language; } }
LanguageHierarchy
extension.
public final class CalcLexer implements Lexer<CalcTokenId> { private static final int EOF = LexerInput.EOF; private static final Map<String,CalcTokenId> keywords = new HashMap<String,CalcTokenId>(); static { keywords.put(CalcTokenId.E.fixedText(), CalcTokenId.E); keywords.put(CalcTokenId.PI.fixedText(), CalcTokenId.PI); } private LexerInput input; private TokenFactory<CalcTokenId> tokenFactory; CalcLexer(LexerRestartInfo<CalcTokenId> info) { this.input = info.input(); this.tokenFactory = info.tokenFactory(); assert (info.state() == null); // passed argument always null } public Token<CalcTokenId> nextToken() { while (true) { int ch = input.read(); switch (ch) { case '+': return token(CalcTokenId.PLUS); case '-': return token(CalcTokenId.MINUS); case '*': return token(CalcTokenId.STAR); case '/': switch (input.read()) { case '/': // in single-line comment while (true) switch (input.read()) { case '\r': input.consumeNewline(); case '\n': case EOF: return token(CalcTokenId.SL_COMMENT); } case '*': // in multi-line comment while (true) { ch = input.read(); while (ch == '*') { ch = input.read(); if (ch == '/') return token(CalcTokenId.ML_COMMENT); else if (ch == EOF) return token(CalcTokenId.ML_COMMENT_INCOMPLETE); } if (ch == EOF) return token(CalcTokenId.ML_COMMENT_INCOMPLETE); } } input.backup(1); return token(CalcTokenId.SLASH); case '(': return token(CalcTokenId.LPAREN); case ')': return token(CalcTokenId.RPAREN); case '0': case '1': case '2': case '3': case '4': case '5': case '6': case '7': case '8': case '9': case '.': return finishIntOrFloatLiteral(ch); case EOF: return null; default: if (Character.isWhitespace((char)ch)) { ch = input.read(); while (ch != EOF && Character.isWhitespace((char)ch)) { ch = input.read(); } input.backup(1); return token(CalcTokenId.WHITESPACE); } if (Character.isLetter((char)ch)) { // identifier or keyword while (true) { if (ch == EOF || !Character.isLetter((char)ch)) { input.backup(1); // backup the extra char (or EOF) // Check for keywords CalcTokenId id = keywords.get(input.readText()); if (id == null) { id = CalcTokenId.IDENTIFIER; } return token(id); } ch = input.read(); // read next char } } return token(CalcTokenId.ERROR); } } } public Object state() { return null; } private Token<CalcTokenId> finishIntOrFloatLiteral(int ch) { boolean floatLiteral = false; boolean inExponent = false; while (true) { switch (ch) { case '.': if (floatLiteral) { return token(CalcTokenId.FLOAT_LITERAL); } else { floatLiteral = true; } break; case '0': case '1': case '2': case '3': case '4': case '5': case '6': case '7': case '8': case '9': break; case 'e': case 'E': // exponent part if (inExponent) { return token(CalcTokenId.FLOAT_LITERAL); } else { floatLiteral = true; inExponent = true; } break; default: input.backup(1); return token(floatLiteral ? CalcTokenId.FLOAT_LITERAL : CalcTokenId.INT_LITERAL); } ch = input.read(); } } private Token<CalcTokenId> token(CalcTokenId id) { return (id.fixedText() != null) ? tokenFactory.getFlyweightToken(id, id.fixedText()) : tokenFactory.createToken(id); } }
The classes containing token ids and the language description should be part of an API. The lexer should only be part of the implementation.
LanguageHierarchy.embedding()
see e.g. org.netbeans.lib.lexer.test.simple.SimpleLanguage
.
Or it may be provided dynamically through the xml layer by using a file in "Editors/language-mime-type/languagesEmbeddingMap" folder named by the token-id's name containing target mime-type and initial and ending skip lengths:
<folder name="Editors"> <folder name="text"> <folder name="x-outer-language"> <folder name="languagesEmbeddingMap"> <file name="WORD"><![CDATA[text/x-inner-language,1,2]]> </file> </folder> </folder> </folder> </folder>
|
|
The sources for the module are in the Apache Git repositories or in the GitHub repositories.
OpenIDE-Module-Module-Dependencies: org.netbeans.modules.lexer/2 > @SPECIFICATION-VERSION@
Read more about the implementation in the answers to architecture questions.