Raw Tokens

Access tokens directly for analysis, transformation, and metrics

3 min read 558 words

Access tokens directly for code analysis, transformation, or custom output.

Thetokenize()Function

Thetokenize()function returns a list ofTokenobjects without formatting:

from rosettes import tokenize

tokens = tokenize("x = 1 + 2", "python")

for token in tokens:
    print(f"{token.type.name:20} {token.value!r:10} L{token.line}:C{token.column}")

Output:

NAME                 'x'        L1:C1
WHITESPACE           ' '        L1:C2
OPERATOR             '='        L1:C3
WHITESPACE           ' '        L1:C4
NUMBER_INTEGER       '1'        L1:C5
WHITESPACE           ' '        L1:C6
OPERATOR             '+'        L1:C7
WHITESPACE           ' '        L1:C8
NUMBER_INTEGER       '2'        L1:C9

Token Structure

EachTokenis an immutableNamedTuple:

from rosettes import Token, TokenType

token = Token(
    type=TokenType.KEYWORD,
    value="def",
    line=1,
    column=1,
)

# Access fields
token.type      # TokenType.KEYWORD
token.value     # "def"
token.line      # 1 (1-based)
token.column    # 1 (1-based)

Tokens are immutable and thread-safe.


Use Cases

Code Metrics

Count tokens by type:

from collections import Counter
from rosettes import tokenize, TokenType

def count_token_types(code: str, language: str) -> Counter[TokenType]:
    tokens = tokenize(code, language)
    return Counter(token.type for token in tokens)

code = '''
def factorial(n):
    if n <= 1:
        return 1
    return n * factorial(n - 1)
'''

counts = count_token_types(code, "python")
print(f"Keywords: {counts[TokenType.KEYWORD]}")
print(f"Functions: {counts[TokenType.NAME_FUNCTION]}")
print(f"Numbers: {counts[TokenType.NUMBER_INTEGER]}")

Code Transformation

Strip comments from code:

from rosettes import tokenize, TokenType

def strip_comments(code: str, language: str) -> str:
    tokens = tokenize(code, language)
    comment_types = {
        TokenType.COMMENT,
        TokenType.COMMENT_SINGLE,
        TokenType.COMMENT_MULTILINE,
    }
    return "".join(
        token.value for token in tokens
        if token.type not in comment_types
    )

code = '''
x = 1  # set x
y = 2  # set y
'''

print(strip_comments(code, "python"))
# x = 1  
# y = 2  

Extract Identifiers

Find all function and variable names:

from rosettes import tokenize, TokenType

def extract_names(code: str, language: str) -> dict[str, set[str]]:
    tokens = tokenize(code, language)
    
    functions = set()
    variables = set()
    
    for token in tokens:
        if token.type == TokenType.NAME_FUNCTION:
            functions.add(token.value)
        elif token.type == TokenType.NAME:
            variables.add(token.value)
    
    return {"functions": functions, "variables": variables}

code = "def greet(name): return f'Hello, {name}'"
names = extract_names(code, "python")

print(names)
# {'functions': {'greet'}, 'variables': {'name'}}

Syntax Validation

Check for unbalanced brackets:

from rosettes import tokenize, TokenType

def check_brackets(code: str, language: str) -> bool:
    tokens = tokenize(code, language)
    
    pairs = {"(": ")", "[": "]", "{": "}"}
    stack = []
    
    for token in tokens:
        if token.type == TokenType.PUNCTUATION:
            if token.value in pairs:
                stack.append(token.value)
            elif token.value in pairs.values():
                if not stack:
                    return False
                if pairs[stack.pop()] != token.value:
                    return False
    
    return len(stack) == 0

print(check_brackets("def foo(): pass", "python"))  # True
print(check_brackets("foo(()", "python"))           # False

Parallel Tokenization

For multiple code blocks, usetokenize_many():

from rosettes import tokenize_many

blocks = [
    ("def foo(): pass", "python"),
    ("const x = 1;", "javascript"),
    ("fn main() {}", "rust"),
]

results = tokenize_many(blocks)

for tokens in results:
    print(f"{len(tokens)} tokens")

On Python 3.14t (free-threaded), this provides true parallelism.


Direct Lexer Access

For maximum control, use the lexer directly:

from rosettes import get_lexer

lexer = get_lexer("python")

# Streaming tokenization (iterator)
for token in lexer.tokenize("x = 1"):
    print(token)

# Fast path (no position tracking)
for token_type, value in lexer.tokenize_fast("x = 1"):
    print(f"{token_type}: {value!r}")

Thetokenize_fast()method returns(TokenType, str)tuples without line/column tracking—useful when you only need token types and values.


Next Steps