Google

Wednesday, April 2, 2008

Elements

Syntax

Parse tree of Python code with inset tokenization
Parse tree of Python code with inset tokenization
Syntax highlighting is often used to aid programmers in the recognition of elements of source code. The language you see here is Python.
Syntax highlighting is often used to aid programmers in the recognition of elements of source code. The language you see here is Python.

A programming language's surface form is known as its syntax. Most programming languages are purely textual; they use sequences of text including words, numbers, and punctuation, much like written natural languages. On the other hand, there are some programming languages which are more graphical in nature, using spatial relationships between symbols to specify a program.

The syntax of a language describes the possible combinations of symbols that form a syntactically correct program. The meaning given to a combination of symbols is handled by semantics. Since most languages are textual, this article discusses textual syntax.

Programming language syntax is usually defined using a combination of regular expressions (for lexical structure) and Backus-Naur Form (for grammatical structure). Below is a simple grammar, based on Lisp:

expression ::= atom | list
atom ::= number | symbol
number ::= [+-]?['0'-'9']+
symbol ::= ['A'-'Z''a'-'z'].*
list ::= '(' expression* ')'

This grammar specifies the following:

  • an expression is either an atom or a list;
  • an atom is either a number or a symbol;
  • a number is an unbroken sequence of one or more decimal digits, optionally preceded by a plus or minus sign;
  • a symbol is a letter followed by zero or more of any characters (excluding whitespace); and
  • a list is a matched pair of parentheses, with zero or more expressions inside it.

The following are examples of well-formed token sequences in this grammar: '12345', '()', '(a b c232 (1))'

Not all syntactically correct programs are semantically correct. Many syntactically correct programs are nonetheless ill-formed, per the language's rules; and may (depending on the language specification and the soundness of the implementation) result in an error on translation or execution. In some cases, such programs may exhibit undefined behavior. Even when a program is well-defined within a language, it may still have a meaning that is not intended by the person who wrote it.

Using natural language as an example, it may not be possible to assign a meaning to a grammatically correct sentence or the sentence may be false:

  • "Colorless green ideas sleep furiously." is grammatically well-formed but has no generally accepted meaning.
  • "John is a married bachelor." is grammatically well-formed but expresses a meaning that cannot be true.

The following C language fragment is syntactically correct, but performs an operation that is not semantically defined (because p is a null pointer, the operations p->real and p->im have no meaning):

complex *p = NULL;
complex abs_p = sqrt (p->real * p->real + p->im * p->im);

No comments: