|
|
TokenHierarchy
can be created for immutable input sources (
CharSequence
or
java.io.Reader
) or for mutable input sources (typically
javax.swing.text.Document
).
For mutable input source the lexer framework updates the tokens in the token hierarchy automatically
with subsequent changes to the underlying text input.
The tokens of the hierarchy always reflect the text of the input at the given time.
TokenHierarchy.tokenSequence()
allows to iterate over a list of
Token
instances.
The token carries a token identification
TokenId
(returned by
Token.id()
) and a text (aka token body) represented as
CharSequence
(returned by
Token.text()
).
TokenUtilities
contains many useful methods related to operations with the token's text such as
TokenUtilities.equals(CharSequence text, Object o),
TokenUtilities.startsWith(CharSequence text, CharSequence prefix),
etc.
It is also possible to debug the text of the token (replace special chars by escapes) by
TokenUtilities.equals(CharSequence text).
A typical token also carries offset of its occurrence in the input text.
As there are many token occurrences where the token text is the same for all
or many occurrences
(e.g. java keywords, operators or a single-space whitespace) the memory consumption
can be decreased considerably by allowing the creation of flyweight token instances
i.e. just one token instance is used for all the token's occurrences
in all the inputs.
Flyweight tokens can be determined by
Token.isFlyweight().
The flyweight tokens do not carry a valid offset (their internal offset is -1).
Therefore
TokenSequence
is used for iteration through the tokens (instead of a regular iterator) and it provides
TokenSequence.offset()
which returns the proper offset even when positioned over a flyweight token.
When holding a reference to the token's instance its offset can also be determined by
Token.offset(TokenHierarchy tokenHierarchy).
The tokenHierarchy
parameter should be always null
and it will be used
for the token hierarchy snapshot support in future releases.
For flyweight tokens the
Token.offset(TokenHierarchy tokenHierarchy)
returns -1 and for regular tokens it gives the same value like
TokenSequence.offset().
There may be applications where the flyweight tokens use could be problematic.
For example if a parser would like to use token instances
in a parse tree nodes to determine the nodes' boundaries then the flyweight tokens
would always return offset -1 so the positions of the parse tree nodes
could not generally be determined from the tokens only.
Therefore there is a possibility to de-flyweight a token by using
TokenSequence.offsetToken()
which checks the current token
and if it's flyweight then it replaces it with a non-flyweight token instance
with a valid offset and with the same properties as the original flyweight token.
Token is identified by its id represented by
TokenId
interface. Token ids for a language are typically implemented as java enums (extensions of
Enum
) but it's not mandatory.
All token ids for the given language are described by
Language.
Each token id may belong
to one or more token categories that allow to better operate
tokens of the same type (e.g. keywords or operators).
Each token id may define its primary category
TokenId.primaryCategory()
and
LanguageHierarchy.createTokenCategories()
may provide additional categories for the token ids for the given language.
Each language description has a mandatory mime-type specification
Language.mimeType()
Although it's a bit non-related information it brings many benefits
because with the mime-type the language can be accompanied
with an arbitrary sort of settings (e.g. syntax coloring information etc.).
SPI providers wishing to provide a
Language
first need to define its SPI counterpart
LanguageHierarchy.
It mainly needs to define token ids in
LanguageHierarchy.createTokenIds()
and lexer in
LanguageHierarchy.createLexer(LexerInput lexerInput, TokenFactory tokenFactory, Object state, LanguagePath languagePath, InputAttributes inputAttributes).
Lexer
reads characters from
LexerInput
and breaks the text into tokens.
Tokens are produced by using methods of
TokenFactory.
As a per-token memory consumption is critical the
Token
does not have any counterpart in SPI. However the framework prevents instantiation
of any other token classes except those contained in the lexer module's implementation.
With language embedding the flat list of tokens becomes in fact a tree-like hierarchy
represented by the
TokenHierarchy
class. Each token can potentially be broken into a sequence of embedded tokens.
The
TokenSequence.embedded()
method can be called to obtain the embedded tokens (when positioned on the branch token).
There are two ways of specifying what language is embedded in a token. The language
can either be specified explicitly (hardcoded) in the
LanguageHierarchy.embedding()
method or there can be a
LanguageProvider
registered in the default Lookup, which will create a
Language
for the embedded language.
There is no limit on the depth of a language hierarchy and there can be as many embedded languages
as needed.
In SPI the language embedding is represented by
LanguageEmbedding.
String text = "public void m() { }"; TokenHierarchy hi = TokenHierarchy.create(text, JavaLanguage.description());
document.readLock(); try { TokenHierarchy hi = TokenHierarchy.get(document); ... // explore tokens etc. } finally { document.readUnlock(); }
document.readLock(); try { TokenHierarchy hi = TokenHierarchy.get(document); TokenSequence ts = hi.tokenSequence(); // If necessary move ts to the requested offset ts.move(offset); while (ts.moveNext()) { Token t = ts.token(); if (t.id() == ...) { ... } if (TokenUtilities.equals(t.text(), "mytext")) { ... } if (ts.offset() == ...) { ... } // Possibly retrieve embedded token sequence TokenSequence embedded = ts.embedded(); if (embedded != null) { // Token has a valid language embedding ... } } } finally { document.readUnlock(); }
org.netbeans.modules.lexer.editorbridge.LexerLayer
in lexer/editorbridge module.
TokenSequence ts = ... LanguagePath lp = ts.languagePath(); if (lp.size() > 1) { ... } // This is embedded token sequence if (lp.topLanguage() == JavaLanguage.description()) { ... } // top-level language of the token hierarchy String mimePath = lp.mimePath(); Object setting-value = some-settings.getSetting(mimePath, setting-name);
public class MyLexer implements Lexer<MyTokenId> { private final int version; ... public MyLexer(LexerInput input, TokenFactory<MyTokenId> tokenFactory, Object state, LanguagePath languagePath, InputAttributes inputAttributes) { ... Integer ver = (inputAttributes != null) ? (Integer)inputAttributes.getValue(languagePath, "version") : null; this.version = (ver != null) ? ver.intValue() : 1; // Use version 1 if not specified explicitly } public Token<MyTokenId> nextToken() { ... if (recognized-assert-keyword) { return (version >= 4) { // "assert" recognized as keyword since version 4 ? keyword(MyTokenId.ASSERT) : identifier(); } ... } ... }The client will then use the following code:
InputAttributes attrs = new InputAttributes(); // The "true" means global value i.e. for any occurrence of the MyLanguage including embeddings attrs.setValue(MyLanguage.description(), "version", Integer.valueOf(3), true); TokenHierarchy hi = TokenHierarchy.create(text, false, SimpleLanguage.description(), null, attrs); ...
Set<MyTokenId> skipIds = EnumSet.of(MyTokenId.COMMENT, MyTokenId.WHITESPACE); TokenHierarchy tokenHierarchy = TokenHierarchy.create(inputText, false, MyLanguage.description(), skipIds, null); ...
org.netbeans.lib.lexer.test.simple.SimpleTokenId
can be copied
or the following example from
org.netbeans.modules.lexer.editorbridge.calc.lang.CalcTokenId
.
language()
method returns the language describing the token ids.
public enum CalcTokenId implements TokenId { WHITESPACE(null, "whitespace"), SL_COMMENT(null, "comment"), ML_COMMENT(null, "comment"), E("e", "keyword"), PI("pi", "keyword"), IDENTIFIER(null, null), INT_LITERAL(null, "number"), FLOAT_LITERAL(null, "number"), PLUS("+", "operator"), MINUS("-", "operator"), STAR("*", "operator"), SLASH("/", "operator"), LPAREN("(", "separator"), RPAREN(")", "separator"), ERROR(null, "error"), ML_COMMENT_INCOMPLETE(null, "comment"); private final String fixedText; private final String primaryCategory; private CalcTokenId(String fixedText, String primaryCategory) { this.fixedText = fixedText; this.primaryCategory = primaryCategory; } public String fixedText() { return fixedText; } public String primaryCategory() { return primaryCategory; } private static final Language<CalcTokenId> language = new LanguageHierarchy<CalcTokenId>() {Note that it is not needed to publish the underlying@Override
protected Collection<CalcTokenId> createTokenIds() { return EnumSet.allOf(CalcTokenId.class); }@Override
protected Map<String,Collection<CalcTokenId>> createTokenCategories() { Map<String,Collection<CalcTokenId>> cats = new HashMap<String,Collection<CalcTokenId>>(); // Incomplete literals cats.put("incomplete", EnumSet.of(CalcTokenId.ML_COMMENT_INCOMPLETE)); // Additional literals being a lexical error cats.put("error", EnumSet.of(CalcTokenId.ML_COMMENT_INCOMPLETE)); return cats; }@Override
protected Lexer<CalcTokenId> createLexer(LexerRestartInfo<CalcTokenId> info) { return new CalcLexer(info); }@Override
protected String mimeType() { return "text/x-calc"; } }.language(); public static final Language<CalcTokenId> language() { return language; } }
LanguageHierarchy
extension.
public final class CalcLexer implements Lexer<CalcTokenId> { private static final int EOF = LexerInput.EOF; private static final Map<String,CalcTokenId> keywords = new HashMap<String,CalcTokenId>(); static { keywords.put(CalcTokenId.E.fixedText(), CalcTokenId.E); keywords.put(CalcTokenId.PI.fixedText(), CalcTokenId.PI); } private LexerInput input; private TokenFactory<CalcTokenId> tokenFactory; CalcLexer(LexerRestartInfo<CalcTokenId> info) { this.input = info.input(); this.tokenFactory = info.tokenFactory(); assert (info.state() == null); // passed argument always null } public Token<CalcTokenId> nextToken() { while (true) { int ch = input.read(); switch (ch) { case '+': return token(CalcTokenId.PLUS); case '-': return token(CalcTokenId.MINUS); case '*': return token(CalcTokenId.STAR); case '/': switch (input.read()) { case '/': // in single-line comment while (true) switch (input.read()) { case '\r': input.consumeNewline(); case '\n': case EOF: return token(CalcTokenId.SL_COMMENT); } case '*': // in multi-line comment while (true) { ch = input.read(); while (ch == '*') { ch = input.read(); if (ch == '/') return token(CalcTokenId.ML_COMMENT); else if (ch == EOF) return token(CalcTokenId.ML_COMMENT_INCOMPLETE); } if (ch == EOF) return token(CalcTokenId.ML_COMMENT_INCOMPLETE); } } input.backup(1); return token(CalcTokenId.SLASH); case '(': return token(CalcTokenId.LPAREN); case ')': return token(CalcTokenId.RPAREN); case '0': case '1': case '2': case '3': case '4': case '5': case '6': case '7': case '8': case '9': case '.': return finishIntOrFloatLiteral(ch); case EOF: return null; default: if (Character.isWhitespace((char)ch)) { ch = input.read(); while (ch != EOF && Character.isWhitespace((char)ch)) { ch = input.read(); } input.backup(1); return token(CalcTokenId.WHITESPACE); } if (Character.isLetter((char)ch)) { // identifier or keyword while (true) { if (ch == EOF || !Character.isLetter((char)ch)) { input.backup(1); // backup the extra char (or EOF) // Check for keywords CalcTokenId id = keywords.get(input.readText()); if (id == null) { id = CalcTokenId.IDENTIFIER; } return token(id); } ch = input.read(); // read next char } } return token(CalcTokenId.ERROR); } } } public Object state() { return null; } private Token<CalcTokenId> finishIntOrFloatLiteral(int ch) { boolean floatLiteral = false; boolean inExponent = false; while (true) { switch (ch) { case '.': if (floatLiteral) { return token(CalcTokenId.FLOAT_LITERAL); } else { floatLiteral = true; } break; case '0': case '1': case '2': case '3': case '4': case '5': case '6': case '7': case '8': case '9': break; case 'e': case 'E': // exponent part if (inExponent) { return token(CalcTokenId.FLOAT_LITERAL); } else { floatLiteral = true; inExponent = true; } break; default: input.backup(1); return token(floatLiteral ? CalcTokenId.FLOAT_LITERAL : CalcTokenId.INT_LITERAL); } ch = input.read(); } } private Token<CalcTokenId> token(CalcTokenId id) { return (id.fixedText() != null) ? tokenFactory.getFlyweightToken(id, id.fixedText()) : tokenFactory.createToken(id); } }
The classes containing token ids and the language description should be part of an API. The lexer should only be part of the implementation.
LanguageHierarchy.embedding()
see e.g. org.netbeans.lib.lexer.test.simple.SimpleLanguage
.
Or it may be provided dynamically through the xml layer by using a file in "Editors/language-mime-type/languagesEmbeddingMap" folder named by the token-id's name containing target mime-type and initial and ending skip lengths:
<folder name="Editors"> <folder name="text"> <folder name="x-outer-language"> <folder name="languagesEmbeddingMap"> <file name="WORD"><![CDATA[text/x-inner-language,1,2]]> </file> </folder> </folder> </folder> </folder>Question (arch-time): What are the time estimates of the work? Answer: The present implementation is stable but there are few missing implementations and other things to be considered:
org.netbeans.lib.lexer.test.TestRandomModify
class.
The sources for the module are in the Apache Git repositories or in the GitHub repositories.
These modules are required in project.xml:
OpenIDE-Module-Module-Dependencies: org.netbeans.modules.lexer/2 > @SPECIFICATION-VERSION@
The current API completely replaces the original one therefore
the major version of the module was increased from 1 to 2.
There are no plans to deprecated any part of the present API
and it should be evolved in a compatible way.
java.io.File
directly?
Answer:
No.
Question (resources-layer):
Does your module provide own layer? Does it create any files or
folders in it? What it is trying to communicate by that and with which
components?
Answer:
No.
Question (resources-read):
Does your module read any resources from layers? For what purpose?
Answer:
No.
Question (resources-mask):
Does your module mask/hide/override any resources provided by other modules in
their layers?
Answer:
No.
Question (resources-preferences):
Does your module uses preferences via Preferences API? Does your module use NbPreferences or
or regular JDK Preferences ? Does it read, write or both ?
Does it share preferences with other modules ? If so, then why ?
Answer:
No.
org.openide.util.Lookup
or any similar technology to find any components to communicate with? Which ones?
Answer:
No
Question (lookup-register):
Do you register anything into lookup for other code to find?
Answer:
No.
Question (lookup-remove):
Do you remove entries of other modules from lookup?
Answer:
No.
System.getProperty
) property?
On a similar note, is there something interesting that you
pass to java.util.logging.Logger
? Or do you observe
what others log?
Answer:
org.netbeans.lib.lexer.TokenHierarchyOperation
-
FINE
level lists lexer changes made in tokens both at the root level
and embedded levels of the token hierarchy after each document modification.
FINER
level in addition will also check the whole token hierarchy
for internal consistency after each modification.
org.netbeans.lib.lexer.TokenList
-
FINE
level forces lexer to perform more thorough and strict checks
in certain situations so this is useful mainly for tests.
Lookahead and state information is generated even for batch-lexed inputs which allows
easier checking of incremental algorithm correctness (fixing of token list after modification).
There are also some additional checks performed
that should verify correctness of the framework and the SPI implementation
classes being used (for example when flyweight tokens are created the text
passed to the token factory is compared to the text in the lexer input).
Question (exec-component):
Is execution of your code influenced by any (string) property
of any of your components?
Answer:
No.
Question (exec-ant-tasks):
Do you define or register any ant tasks that other can use?
Answer:
No.
Question (exec-classloader):
Does your code create its own class loader(s)?
Answer:
No.
Question (exec-reflection):
Does your code use Java Reflection to execute other code?
Answer:
No.
Question (exec-privateaccess):
Are you aware of any other parts of the system calling some of
your methods by reflection?
Answer:
No.
Question (exec-process):
Do you execute an external process from your module? How do you ensure
that the result is the same on different platforms? Do you parse output?
Do you depend on result code?
Answer:
No.
Question (exec-introspection):
Does your module use any kind of runtime type information (instanceof
,
work with java.lang.Class
, etc.)?
Answer:
No.
Question (exec-threading):
What threading models, if any, does your module adhere to? How the
project behaves with respect to threading?
Answer:
Use of token hierarchies for mutable input sources
must adhere to the locking mechanisms for the input sources themselves.
java.awt.datatransfer.Transferable
?
Answer:
No clipboard support.