|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectorg.apache.lucene.analysis.Token
public class Token
A Token is an occurrence of a term from the text of a field. It consists of a term's text, the start and end offset of the term in the text of the field, and a type string.
The start and end offsets permit applications to re-associate a token with its source text, e.g., to display highlighted query terms in a document browser, or to show matching text fragments in a KWIC (KeyWord In Context) display, etc.
The type is a string, assigned by a lexical analyzer (a.k.a. tokenizer), naming the lexical or syntactic class that the token belongs to. For example an end of sentence marker token might be implemented with type "eos". The default token type is "word".
A Token can optionally have metadata (a.k.a. Payload) in the form of a variable
length byte array. Use TermPositions.getPayloadLength()
and
TermPositions.getPayload(byte[], int)
to retrieve the payloads from the index.
WARNING: The status of the Payloads feature is experimental.
The APIs introduced here might change in the future and will not be
supported anymore in such a case.
NOTE: As of 2.3, Token stores the term text
internally as a malleable char[] termBuffer instead of
String termText. The indexing code and core tokenizers
have been changed to re-use a single Token instance, changing
its buffer and other fields in-place as the Token is
processed. This provides substantially better indexing
performance as it saves the GC cost of new'ing a Token and
String for every term. The APIs that accept String
termText are still available but a warning about the
associated performance cost has been added (below). The
termText()
method has been deprecated.
Tokenizers and filters should try to re-use a Token
instance when possible for best performance, by
implementing the TokenStream.next(Token)
API.
Failing that, to create a new Token you should first use
one of the constructors that starts with null text. To load
the token from a char[] use setTermBuffer(char[], int, int)
.
To load from a String use setTermBuffer(String)
or setTermBuffer(String, int, int)
.
Alternatively you can get the Token's termBuffer by calling either termBuffer()
,
if you know that your text is shorter than the capacity of the termBuffer
or resizeTermBuffer(int)
, if there is any possibility
that you may need to grow the buffer. Fill in the characters of your term into this
buffer, with String.getChars(int, int, char[], int)
if loading from a string,
or with System.arraycopy(Object, int, Object, int, int)
, and finally call setTermLength(int)
to
set the length of the term text. See LUCENE-969
for details.
Typical reuse patterns:
return reusableToken.reinit(string, startOffset, endOffset[, type]);
return reusableToken.reinit(string, 0, string.length(), startOffset, endOffset[, type]);
return reusableToken.reinit(buffer, 0, buffer.length, startOffset, endOffset[, type]);
return reusableToken.reinit(buffer, start, end - start, startOffset, endOffset[, type]);
return reusableToken.reinit(source.termBuffer(), 0, source.termLength(), source.startOffset(), source.endOffset()[, source.type()]);
TokenStreams
can be chained, one cannot assume that the Token's
current type is correct.
Payload
Field Summary | |
---|---|
static String |
DEFAULT_TYPE
|
Constructor Summary | |
---|---|
Token()
Constructs a Token will null text. |
|
Token(char[] startTermBuffer,
int termBufferOffset,
int termBufferLength,
int start,
int end)
Constructs a Token with the given term buffer (offset & length), start and end offsets |
|
Token(int start,
int end)
Constructs a Token with null text and start & end offsets. |
|
Token(int start,
int end,
int flags)
Constructs a Token with null text and start & end offsets plus flags. |
|
Token(int start,
int end,
String typ)
Constructs a Token with null text and start & end offsets plus the Token type. |
|
Token(String text,
int start,
int end)
Deprecated. |
|
Token(String text,
int start,
int end,
int flags)
Deprecated. |
|
Token(String text,
int start,
int end,
String typ)
Deprecated. |
Method Summary | |
---|---|
void |
clear()
Resets the term text, payload, flags, and positionIncrement to default. |
Object |
clone()
|
Token |
clone(char[] newTermBuffer,
int newTermOffset,
int newTermLength,
int newStartOffset,
int newEndOffset)
Makes a clone, but replaces the term buffer & start/end offset in the process. |
int |
endOffset()
Returns this Token's ending offset, one greater than the position of the last character corresponding to this token in the source text. |
boolean |
equals(Object obj)
|
int |
getFlags()
EXPERIMENTAL: While we think this is here to stay, we may want to change it to be a long. |
Payload |
getPayload()
Returns this Token's payload. |
int |
getPositionIncrement()
Returns the position increment of this Token. |
int |
hashCode()
|
Token |
reinit(char[] newTermBuffer,
int newTermOffset,
int newTermLength,
int newStartOffset,
int newEndOffset)
Shorthand for calling clear() ,
setTermBuffer(char[], int, int) ,
setStartOffset(int) ,
setEndOffset(int)
setType(java.lang.String) on Token.DEFAULT_TYPE |
Token |
reinit(char[] newTermBuffer,
int newTermOffset,
int newTermLength,
int newStartOffset,
int newEndOffset,
String newType)
Shorthand for calling clear() ,
setTermBuffer(char[], int, int) ,
setStartOffset(int) ,
setEndOffset(int) ,
setType(java.lang.String) |
Token |
reinit(String newTerm,
int newStartOffset,
int newEndOffset)
Shorthand for calling clear() ,
setTermBuffer(String) ,
setStartOffset(int) ,
setEndOffset(int)
setType(java.lang.String) on Token.DEFAULT_TYPE |
Token |
reinit(String newTerm,
int newTermOffset,
int newTermLength,
int newStartOffset,
int newEndOffset)
Shorthand for calling clear() ,
setTermBuffer(String, int, int) ,
setStartOffset(int) ,
setEndOffset(int)
setType(java.lang.String) on Token.DEFAULT_TYPE |
Token |
reinit(String newTerm,
int newTermOffset,
int newTermLength,
int newStartOffset,
int newEndOffset,
String newType)
Shorthand for calling clear() ,
setTermBuffer(String, int, int) ,
setStartOffset(int) ,
setEndOffset(int)
setType(java.lang.String) |
Token |
reinit(String newTerm,
int newStartOffset,
int newEndOffset,
String newType)
Shorthand for calling clear() ,
setTermBuffer(String) ,
setStartOffset(int) ,
setEndOffset(int)
setType(java.lang.String) |
void |
reinit(Token prototype)
Copy the prototype token's fields into this one. |
void |
reinit(Token prototype,
char[] newTermBuffer,
int offset,
int length)
Copy the prototype token's fields into this one, with a different term. |
void |
reinit(Token prototype,
String newTerm)
Copy the prototype token's fields into this one, with a different term. |
char[] |
resizeTermBuffer(int newSize)
Grows the termBuffer to at least size newSize, preserving the existing content. |
void |
setEndOffset(int offset)
Set the ending offset. |
void |
setFlags(int flags)
|
void |
setPayload(Payload payload)
Sets this Token's payload. |
void |
setPositionIncrement(int positionIncrement)
Set the position increment. |
void |
setStartOffset(int offset)
Set the starting offset. |
void |
setTermBuffer(char[] buffer,
int offset,
int length)
Copies the contents of buffer, starting at offset for length characters, into the termBuffer array. |
void |
setTermBuffer(String buffer)
Copies the contents of buffer into the termBuffer array. |
void |
setTermBuffer(String buffer,
int offset,
int length)
Copies the contents of buffer, starting at offset and continuing for length characters, into the termBuffer array. |
void |
setTermLength(int length)
Set number of valid characters (length of the term) in the termBuffer array. |
void |
setTermText(String text)
Deprecated. use setTermBuffer(char[], int, int) or
setTermBuffer(String) or
setTermBuffer(String, int, int) . |
void |
setType(String type)
Set the lexical type. |
int |
startOffset()
Returns this Token's starting offset, the position of the first character corresponding to this token in the source text. |
String |
term()
Returns the Token's term text. |
char[] |
termBuffer()
Returns the internal termBuffer character array which you can then directly alter. |
int |
termLength()
Return number of valid characters (length of the term) in the termBuffer array. |
String |
termText()
Deprecated. This method now has a performance penalty because the text is stored internally in a char[]. If possible, use termBuffer() and termLength() directly instead. If you really need a
String, use term() |
String |
toString()
|
String |
type()
Returns this Token's lexical type. |
Methods inherited from class java.lang.Object |
---|
finalize, getClass, notify, notifyAll, wait, wait, wait |
Field Detail |
---|
public static final String DEFAULT_TYPE
Constructor Detail |
---|
public Token()
public Token(int start, int end)
start
- start offset in the source textend
- end offset in the source textpublic Token(int start, int end, String typ)
start
- start offset in the source textend
- end offset in the source texttyp
- the lexical type of this Tokenpublic Token(int start, int end, int flags)
start
- start offset in the source textend
- end offset in the source textflags
- The bits to set for this tokenpublic Token(String text, int start, int end)
text
- term textstart
- start offsetend
- end offsetpublic Token(String text, int start, int end, String typ)
text
- term textstart
- start offsetend
- end offsettyp
- token typepublic Token(String text, int start, int end, int flags)
text
- start
- end
- flags
- token type bitspublic Token(char[] startTermBuffer, int termBufferOffset, int termBufferLength, int start, int end)
startTermBuffer
- termBufferOffset
- termBufferLength
- start
- end
- Method Detail |
---|
public void setPositionIncrement(int positionIncrement)
TokenStream
, used in phrase
searching.
The default value is one.
Some common uses for this are:
positionIncrement
- the distance from the prior termTermPositions
public int getPositionIncrement()
setPositionIncrement(int)
public void setTermText(String text)
setTermBuffer(char[], int, int)
or
setTermBuffer(String)
or
setTermBuffer(String, int, int)
.
public final String termText()
termBuffer()
and termLength()
directly instead. If you really need a
String, use term()
public final String term()
termBuffer()
and termLength()
directly instead. If you really need a
String, use this method, which is nothing more than
a convenience call to new String(token.termBuffer(), 0, token.termLength())
public final void setTermBuffer(char[] buffer, int offset, int length)
buffer
- the buffer to copyoffset
- the index in the buffer of the first character to copylength
- the number of characters to copypublic final void setTermBuffer(String buffer)
buffer
- the buffer to copypublic final void setTermBuffer(String buffer, int offset, int length)
buffer
- the buffer to copyoffset
- the index in the buffer of the first character to copylength
- the number of characters to copypublic final char[] termBuffer()
resizeTermBuffer(int)
to increase it. After
altering the buffer be sure to call setTermLength(int)
to record the number of valid
characters that were placed into the termBuffer.
public char[] resizeTermBuffer(int newSize)
setTermBuffer(char[], int, int)
,
setTermBuffer(String)
, or
setTermBuffer(String, int, int)
to optimally combine the resize with the setting of the termBuffer.
newSize
- minimum size of the new termBuffer
public final int termLength()
public final void setTermLength(int length)
resizeTermBuffer(int)
first.
length
- the truncated lengthpublic final int startOffset()
public void setStartOffset(int offset)
startOffset()
public final int endOffset()
public void setEndOffset(int offset)
endOffset()
public final String type()
public final void setType(String type)
type()
public int getFlags()
type()
, although they do share similar purposes.
The flags can be used to encode information about the token for use by other TokenFilter
s.
public void setFlags(int flags)
getFlags()
public Payload getPayload()
public void setPayload(Payload payload)
public String toString()
toString
in class Object
public void clear()
public Object clone()
clone
in class Object
public Token clone(char[] newTermBuffer, int newTermOffset, int newTermLength, int newStartOffset, int newEndOffset)
public boolean equals(Object obj)
equals
in class Object
public int hashCode()
hashCode
in class Object
public Token reinit(char[] newTermBuffer, int newTermOffset, int newTermLength, int newStartOffset, int newEndOffset, String newType)
clear()
,
setTermBuffer(char[], int, int)
,
setStartOffset(int)
,
setEndOffset(int)
,
setType(java.lang.String)
public Token reinit(char[] newTermBuffer, int newTermOffset, int newTermLength, int newStartOffset, int newEndOffset)
clear()
,
setTermBuffer(char[], int, int)
,
setStartOffset(int)
,
setEndOffset(int)
setType(java.lang.String)
on Token.DEFAULT_TYPE
public Token reinit(String newTerm, int newStartOffset, int newEndOffset, String newType)
clear()
,
setTermBuffer(String)
,
setStartOffset(int)
,
setEndOffset(int)
setType(java.lang.String)
public Token reinit(String newTerm, int newTermOffset, int newTermLength, int newStartOffset, int newEndOffset, String newType)
clear()
,
setTermBuffer(String, int, int)
,
setStartOffset(int)
,
setEndOffset(int)
setType(java.lang.String)
public Token reinit(String newTerm, int newStartOffset, int newEndOffset)
clear()
,
setTermBuffer(String)
,
setStartOffset(int)
,
setEndOffset(int)
setType(java.lang.String)
on Token.DEFAULT_TYPE
public Token reinit(String newTerm, int newTermOffset, int newTermLength, int newStartOffset, int newEndOffset)
clear()
,
setTermBuffer(String, int, int)
,
setStartOffset(int)
,
setEndOffset(int)
setType(java.lang.String)
on Token.DEFAULT_TYPE
public void reinit(Token prototype)
prototype
- public void reinit(Token prototype, String newTerm)
prototype
- newTerm
- public void reinit(Token prototype, char[] newTermBuffer, int offset, int length)
prototype
- newTermBuffer
- offset
- length
-
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |