Python Lexical Structure

The lexical structure of a programming language is the set of basic rules that govern how you write programs in that language. It is the lowest-level syntax of the language, specifying such things as what variable names look like and how to denote comments. Each Python source file, like any other text file, is a sequence of characters. You can also usefully consider it a sequence of lines, tokens, or statements. These different lexical views complement each other. Python is very particular about program layout, especially regarding lines and indentation: pay attention to this information if you are coming to Python from another language.

Lines and Indentation

A Python program is a sequence of logical lines, each made up of one or more physical lines. Each physical line may end with a comment. A hash sign # that is not inside a string literal starts a comment. All characters after the #, up to but excluding the line end, are the comment: Python ignores them. A line containing only whitespace, possibly with a comment, is a blank line: Python ignores it.

In Python, the end of a physical line marks the end of most statements. Unlike in other languages, you don’t normally terminate Python statements with a delimiter, such as a semicolon (;). When a statement is too long to fit on a physical line, you can join two adjacent physical lines into a logical line by ensuring that the first physical line has no comment and ends with a backslash (\). However, Python also automatically joins adjacent physical lines into one logical line if an open parenthesis ((), bracket ([), or brace ({) has not yet been closed: take advantage of this mechanism to produce more readable code than you’d get with backslashes at line ends. Triple-quoted string literals can also span physical lines. Physical lines after the first one in a logical line are known as continuation lines. Indentation issues apply to the first physical line of each logical line, not to continuation lines.

Python uses indentation to express the block structure of a program. Unlike other languages, Python does not use braces, or other begin/end delimiters, around blocks of statements; indentation is the only way to denote blocks. Each logical line in a Python program is indented by the whitespace on its left. A block is a contiguous sequence of logical lines, all indented by the same amount; a logical line with less indentation ends the block. All statements in a block must have the same indentation, as must all clauses in a compound statement. The first statement in a source file must have no indentation (i.e., must not begin with any whitespace). Statements that you type at the interactive interpreter primary prompt >>> must also have no indentation.

v2 logically replaces each tab by up to eight spaces, so that the next character after the tab falls into logical column 9, 17, 25, and so on. Standard Python style is to use four spaces (never tabs) per indentation level.

Don’t mix spaces and tabs for indentation, since different tools (e.g., editors, email systems, printers) treat tabs differently. The -t and -tt options to the v2 Python interpreter ensure against inconsistent tab and space usage in Python source code. In v3, Python does not allow mixing tabs and spaces for indentation.

Use spaces, not tabs

We recommend you configure your favorite editor to expand tabs to four spaces, so that all Python source code you write contains just spaces, not tabs. This way, all tools, including Python itself, are consistent in handling indentation in your Python source files. Optimal Python style is to indent blocks by exactly four spaces, and use no tabs.

Character Sets

A v3 source file can use any Unicode character, encoded as UTF-8. (Characters with codes between 0 and 127, AKA ASCII characters, encode in UTF-8 into the respective single bytes, so an ASCII text file is a fine v3 Python source file, too.)

A v2 source file is usually made up of characters from the ASCII set (character codes between 0 and 127).

In both v2 and v3, you may choose to tell Python that a certain source file is written in a different encoding. In this case, Python uses that encoding to read the file (in v2, you can use non-ASCII characters only in comments and string literals).

To let Python know that a source file is written with a nonstandard encoding, start your source file with a comment whose form must be, for example:

# coding: iso-8859-1

After coding:, write the name of a codec known to Python and ASCII-compatible, such as utf-8 or iso-8859-1. Note that this coding directive comment (also known as an encoding declaration) is taken as such only if it is at the start of a source file. The only effect of a coding directive in v2 is to let you use non-ASCII characters in string literals and comments. Best practice is to use utf-8 for all of your text files, including Python source files.

Tokens

Python breaks each logical line into a sequence of elementary lexical components known as tokens. Each token corresponds to a substring of the logical line. The normal token types are identiers, keywords, operators, delimiters, and literals, which we cover in the following sections. You may freely use whitespace between tokens to separate them. Some whitespace separation is necessary between logically adjacent identifiers or keywords; otherwise, Python would parse them as a single, longer identifier. For example, ifx is a single identifier; to write the keyword if followed by the identifier x, you need to insert some whitespace (e.g., if x).

Identifers

An identier is a name used to specify a variable, function, class, module, or other object. An identifier starts with a letter (in v2, A to Z or a to z; in v3, other characters that Unicode classifies as letters are also allowed) or an underscore (_), followed by zero or more letters, underscores, and digits (in v2, 0 to 9; in v3, other characters that Unicode classifies as digits or combining marks are also allowed). See this website for a table identifying which Unicode characters can start or continue a v3 identifier. Case is significant: lowercase and uppercase letters are distinct. Punctuation characters such as @, $, and ! are not allowed in identifiers.

Normal Python style is to start class names with an uppercase letter, and other identifiers with a lowercase letter. Starting an identifier with a single leading underscore indicates by convention that the identifier is meant to be private. Starting an identifier with two leading underscores indicates a strongly private identifier; if the identifier also ends with two trailing underscores, however, this means that the identifier is a language-defined special name.

Single underscore _ in the interactive interpreter

The identifier _ (a single underscore) is special in interactive interpreter sessions: the interpreter binds _ to the result of the last expression statement it has evaluated interactively, if any.

Keywords

Python has keywords (31 of them in v2; 33 in v3), which are identifiers that Python reserves for special syntactic uses. Keywords contain lowercase letters only. You cannot use keywords as regular identifiers (thus, they’re sometimes known as “reserved words”). Some keywords begin simple statements or clauses of compound statements, while other keywords are operators. We cover all the keywords in detail in the later chapter. The keywords in v2 are:

and     continue  except   global  lambda  raise  yield
as      def       exec     if      not     return
assert  del       finally  import  or      try
break   elif      for      in      pass    while
class   else      from     is      print   with

In v3, exec and print are no longer keywords: they were statements in v2, but they’re now functions in v3. (To use the print function in v2, start your source file with from __future__ import print_function) False, None, True, and nonlocal are new, additional keywords in v3 (out of them, False, None, and True were already built-in constants in v2, but they were not technically keywords). Special tokens async and await, are not currently keywords, but they’re scheduled to become keywords in Python 3.7.

Operators

Python uses nonalphanumeric characters and character combinations as operators. Python recognizes the following operators, which are covered in detail in later chapter:

+ - * / % ** // << >> &
| ^ ~ < <= > >= <> != ==

In v3 only, you can also use @ as an operator (in matrix multiplication), although the character is technically a delimiter.

Delimiters

Python uses the following characters and combinations as delimiters in expressions, list, dictionary, and set literals, and various statements, among other purposes:

( )   [ ]   { }
,   :   .   `   =   ;  @
+=  -=  *=  /=   //=  %=
&=  |=  ^=  >>=  <<=  **=

The period (.) can also appear in floating-point literals (e.g., 2.3) and imaginary literals (e.g., 2.3j). The last two rows are the augmented assignment operators, which are delimiters, but also perform operations. We discuss the syntax for the various delimiters when we introduce the objects or statements using them.

The following characters have special meanings as part of other tokens:

' " # \

' and " surround string literals. # outside of a string starts a comment. \ at the end of a physical line joins the following physical line into one logical line; \ is also an escape character in strings. The characters $ and ?, all control characters except whitespace, and, in v2, all characters with ISO codes above 126 (i.e., non-ASCII characters, such as accented letters) can never be part of the text of a Python program, except in comments or string literals. (To use non-ASCII characters in comments or string literals in v2, you must start your Python source file with a coding directive.)

Literals

A literal is the direct denotation in a program of a data value (a number, string, or container). The following are number and string literals in Python:

42        # Integer literal
3.14      # Floating-point literal
1.0j      # Imaginary literal
'hello'   # String literal
"world"   # Another string literal
"""Good
night"""  # Triple-quoted string literal

Combining number and string literals with the appropriate delimiters, you can build literals that directly denote data values of container types:

[42, 3.14, 'hello']    # List
[]                     # Empty list
100, 200, 300          # Tuple
()                     # Empty tuple
{'x':42, 'y':3.14}     # Dictionary
{}                     # Empty dictionary
{1, 2, 4, 8, 'string'} # Set
# There is no literal to denote an empty set; use set() instead

We cover the syntax for literals in detail in later chapter, when we discuss the various data types Python supports.

Statements

You can look at a Python source file as a sequence of simple and compound statements. Unlike some other languages, Python has no “declarations” or other top-level syntax elements: just statements.

Simple statements

A simple statement is one that contains no other statements. A simple statement lies entirely within a logical line. As in many other languages, you may place more than one simple statement on a single logical line, with a semicolon (;) as the separator. However, one statement per line is the usual and recommended Python style, and makes programs more readable.

Any expression can stand on its own as a simple statement. When working interactively, the interpreter shows the result of an expression statement you enter at the prompt (>>>) and binds the result to a global variable named _ (underscore). Apart from interactive sessions, expression statements are useful only to call functions (and other callables) that have side effects (e.g., perform output, change global variables, or raise exceptions).

An assignment is a simple statement that assigns values to variables. An assignment in Python is a statement and can never be part of an expression.

Compound statements

A compound statement contains one or more other statements and controls their execution. A compound statement has one or more clauses, aligned at the same indentation. Each clause has a header starting with a keyword and ending with a colon (:), followed by a body, which is a sequence of one or more statements. When the body contains multiple statements, also known as a block, these statements are on separate logical lines after the header line, indented four spaces rightward. The block lexically ends when the indentation returns to that of the clause header (or further left from there, to the indentation of some enclosing compound statement). Alternatively, the body can be a single simple statement, following the : on the same logical line as the header. The body may also consist of several simple statements on the same line with semicolons between them, but, as we’ve already mentioned, this is not good Python style.

Next And Prev

Next: Python Data Types

Prev: The Python Interpreter

Relate article

Introduction to Python

Python Data Types

Variables and Other References

Python Installation

The Python Interpreter