# Raw Tokens URL: /docs/extending/raw-tokens/ Section: extending Tags: tokens, analysis, customization -------------------------------------------------------------------------------- # Raw Tokens Access tokens directly for code analysis, transformation, or custom output. ## The `tokenize()` Function The `tokenize()` function returns a list of `Token` objects without formatting: ```python from rosettes import tokenize tokens = tokenize("x = 1 + 2", "python") for token in tokens: print(f"{token.type.name:20} {token.value!r:10} L{token.line}:C{token.column}") ``` Output: ```text NAME 'x' L1:C1 WHITESPACE ' ' L1:C2 OPERATOR '=' L1:C3 WHITESPACE ' ' L1:C4 NUMBER_INTEGER '1' L1:C5 WHITESPACE ' ' L1:C6 OPERATOR '+' L1:C7 WHITESPACE ' ' L1:C8 NUMBER_INTEGER '2' L1:C9 ``` --- ## Token Structure Each `Token` is an immutable `NamedTuple`: ```python from rosettes import Token, TokenType token = Token( type=TokenType.KEYWORD, value="def", line=1, column=1, ) # Access fields token.type # TokenType.KEYWORD token.value # "def" token.line # 1 (1-based) token.column # 1 (1-based) ``` Tokens are **immutable** and **thread-safe**. --- ## Use Cases ### Code Metrics Count tokens by type: ```python from collections import Counter from rosettes import tokenize, TokenType def count_token_types(code: str, language: str) -> Counter[TokenType]: tokens = tokenize(code, language) return Counter(token.type for token in tokens) code = ''' def factorial(n): if n <= 1: return 1 return n * factorial(n - 1) ''' counts = count_token_types(code, "python") print(f"Keywords: {counts[TokenType.KEYWORD]}") print(f"Functions: {counts[TokenType.NAME_FUNCTION]}") print(f"Numbers: {counts[TokenType.NUMBER_INTEGER]}") ``` ### Code Transformation Strip comments from code: ```python from rosettes import tokenize, TokenType def strip_comments(code: str, language: str) -> str: tokens = tokenize(code, language) comment_types = { TokenType.COMMENT, TokenType.COMMENT_SINGLE, TokenType.COMMENT_MULTILINE, } return "".join( token.value for token in tokens if token.type not in comment_types ) code = ''' x = 1 # set x y = 2 # set y ''' print(strip_comments(code, "python")) # x = 1 # y = 2 ``` ### Extract Identifiers Find all function and variable names: ```python from rosettes import tokenize, TokenType def extract_names(code: str, language: str) -> dict[str, set[str]]: tokens = tokenize(code, language) functions = set() variables = set() for token in tokens: if token.type == TokenType.NAME_FUNCTION: functions.add(token.value) elif token.type == TokenType.NAME: variables.add(token.value) return {"functions": functions, "variables": variables} code = "def greet(name): return f'Hello, {name}'" names = extract_names(code, "python") print(names) # {'functions': {'greet'}, 'variables': {'name'}} ``` ### Syntax Validation Check for unbalanced brackets: ```python from rosettes import tokenize, TokenType def check_brackets(code: str, language: str) -> bool: tokens = tokenize(code, language) pairs = {"(": ")", "[": "]", "{": "}"} stack = [] for token in tokens: if token.type == TokenType.PUNCTUATION: if token.value in pairs: stack.append(token.value) elif token.value in pairs.values(): if not stack: return False if pairs[stack.pop()] != token.value: return False return len(stack) == 0 print(check_brackets("def foo(): pass", "python")) # True print(check_brackets("foo(()", "python")) # False ``` --- ## Parallel Tokenization For multiple code blocks, use `tokenize_many()`: ```python from rosettes import tokenize_many blocks = [ ("def foo(): pass", "python"), ("const x = 1;", "javascript"), ("fn main() {}", "rust"), ] results = tokenize_many(blocks) for tokens in results: print(f"{len(tokens)} tokens") ``` On Python 3.14t (free-threaded), this provides true parallelism. --- ## Direct Lexer Access For maximum control, use the lexer directly: ```python from rosettes import get_lexer lexer = get_lexer("python") # Streaming tokenization (iterator) for token in lexer.tokenize("x = 1"): print(token) # Fast path (no position tracking) for token_type, value in lexer.tokenize_fast("x = 1"): print(f"{token_type}: {value!r}") ``` The `tokenize_fast()` method returns `(TokenType, str)` tuples without line/column tracking—useful when you only need token types and values. --- ## Next Steps - [[docs/reference/token-types|Token Types]] — Complete token type reference - [[docs/extending/custom-formatter|Custom Formatter]] — Build custom output formats - [[docs/reference/api|API Reference]] — Full API documentation -------------------------------------------------------------------------------- Metadata: - Word Count: 558 - Reading Time: 3 minutes