Data, Syntax and Semantics

Chapter 1
Introduction

Data, syntax and semantics are three fundamental ideas in Computer Science. They are Big Ideas. That is: they are ideas that are general and widely applicable; they are deep and are made precise in different ways; they lead to beautiful and useful theories. The ideas are present in any computing application. As its contributions to Science and human affairs mature, Computer Science will influence profoundly many areas of thinking, making and doing. Thus data, syntax and semantics are three ideas which are fundamental in Science and other intellectual fields.

Our introduction to these concepts of data, syntax and semantics is through the task of modelling and analysing programming languages. Programming languages abound in Computer Science and range from large general purpose languages to the small input languages of application packages. Some languages are well known and widely used, some belong to communities of specialists, some are purely experimental and do not have a user group. The number of programming languages cannot be counted!

In learning a programming language the aim is to read, write and execute programs. A programmer should understand the organisation and the operation of the program, and be able to predict and test its input-output behaviour, to know, for example, if the program computes a specified function. After gaining fluency in one or more programming languages it is both natural and necessary to reflect on the components that make up the languages and enquire about their individual roles and properties. A multitude of questions can be formulated about the capabilities and shortcomings of a language and its relation with other languages. For example, How is this programming language specified? and Is this language more powerful than that? To awaken the reader's curiosity, we will ask many questions like these shortly, in Section 1.2. To understand something of the nature and essential elements of programming languages, we will look at their overall structure and separate components rather abstractly and analytically, and answer some of these questions.

Now, this analysis, comparison and classification of language components requires a large scale scientific investigation. The investigation is organised as the study of three aspects of programs:

Data: the information to be transformed by the program
Syntax: the text of the program
Semantics: the behaviour of the program

To create a theory of programming languages we need to discover fundamental concepts, methods and results that can reveal the essential structure of programming languages, and can answer our questions.

In this first chapter we will simply prepare our minds for theoretical investigations. We will explore the scientific view of programming languages (Section 1.1) raise plenty of questions about programming languages for which we need answers (Section 1.2), and look at some raw material for modelling programming languages (Section 1.3).

We will focus on the theory of programming in the small and have only occasional encounters with programming in the large. Measured against the state of contemporary language technology, our scientific goal seems modest indeed:

To understand the general principles, structure and operation of imperative programs that compute functions on any kind of data.

Yet attaining this goal will bring rich rewards.

1.1 Science and the aims of modelling programming languages

Computing systems are artificial. They are designed and developed by and for human use. They are superseded and are discarded, sometimes with uncomfortable speed. Some are publically managed and some are commercial products. However, the scientific study of computing systems is remarkably similar to the scientific study of physical systems which are God given and timeless. Roughly speaking, scientific studies have theoretical and practical objectives, pursue answers to definite questions, and require mathematical models and experimental methods.

Theoretical Computer Science develops mathematical models and theories for the design and analysis of

data;
specifications of computational problems with data;
algorithms for transforming data;
programs and programming languages for expressing algorithms;
systems and machine architectures for implementing programming languages.

Simply put, the subject is aimed towards the discovery and application of fundamental principles, models, methods and results, which are intended

to help understand the nature, scope and limits of computing systems;
to help software and hardware engineers make computing systems.

Fundamentally, Theoretical Computer Science is speculative. It thrives on curiosity and the freedom to imagine, idealise and simplify. Theoretical Computer Science creates ideas for new data types and applications, new ways of specifying, programming and reasoning, and new designs for machines and devices. For example, over the past decades, theorists have been thinking about new ways of doing exact calculations with real numbers, the integration of simulated and experimental data in virtual worlds, languages for specifying and programming distributed and mobile devices, ways of modelling and proving concurrent systems correct, asynchronous designs for chips, and quantum and biological computers. In time, some theories become the foundations of mature technologies, whilst others stubbornly resist practical or commercial exploitation. Theoretical Computer Science also has many problems that retain importance long after the technology is established. The theory of programming languages is laden with such legacy problems: the theories of types, concurrency, objects, agents are rich in difficulties and mysteries, both conceptual and mathematical.

Theories of programming languages and program construction are a fundamental area of Theoretical Computer Science. There are many programming constructs and program development techniques and tools, all of which are the fruits of, or require, theoretical investigation. In our time, it is believed widely that the development of theories is necessary for the practical craft of program construction to become a mathematically well-founded engineering science. Independently of such a goal, we believe that the development of theories is necessary to satisfy our curiosity and to understand what is being done, what could be done, and how to do better.

The scientific approach of the theory of programming languages places three intellectual requirements on the reader:

Intellectual Aims
1. To ask simple questions.
2. To make and analyse simple mathematical models.
3. To think deeply and express ideas precisely.

Now we will formulate a number of questions and set the scene for some mathematical theories to answer them. Our theories will show how it is possible

to model mathematically any kind of data;
to model mathematically the syntax of a programming language;
to model mathematically the semantics of a programming language.

1.2 Some Scientific Questions about Programming Languages

Let us begin by posing some simple questions about a programming language. The questions will make us reflect on our ideas about programming languages and programs. They require us

to think generally and abstractly.

They also require us

to explore our existing knowledge, experience and prejudices.

This is part of the raw material from which theories are made. Most of the questions we pose are insufficiently precise to allow a definite answer. Thus, one of our most important tasks will be

to make questions precise, using mathematical models, and hence turn them into technical problems that can be solved definitively.

By no means all of the questions will be given an answer in this book: readers are invited to return to this section from time to time to see which questions they can formulate and answer precisely.

Let $$L$$ and $$L^\prime$$ be any programming languages, real or imaginary. Try to answer the following seemingly vague questions.

Data

What is data and where does it come from? How do we choose operations on data? How do we compare the choice of different sets of operations? How do we know we have enough operations on data? What exactly are data types? What is an interface to a data type? What is an implementation of a data type? How do we specify data types like integers, reals or strings, independently of programming languages? How do we model data types for users' applications? To what extent is a data type independent of its implementation? How do we compare two implementations of a data type? Can any data type be implemented using an imperative language? Are the representations of the natural numbers based upon decimal, binary, octal and Roman notations equivalent? How accurate as approximations are implementations of the real numbers? What are the effects of approximating infinite sets of data by finite subsets? Are error messages necessary for data types? What are the base data types of $$L$$? What data types can be implemented in $$L$$? Can any data type be implemented in $$L$$?

Syntax

What is syntax and why is there so much of it? Is any notation a syntax? How do we choose and make a syntax using symbols and rules? How do we check for errors in syntax? How do we transform program syntaxes, as in substitution and expansion, compilation and interpretation? How do we specify and transform texts for pretty printing, internet publication and slide projection? What are the syntactic categories such identifiers, types, operations, declarations, commands, and procedures, classes? What are scope and binding rules? What are the benefits of prefix, infix, postfix, or mixfix notations? How do we define exactly the syntax of $$L$$? How do we specify the set of legal programs of $$L$$? Is there an algorithm that checks that the syntax of a program of $$L$$ is correct?

Semantics

What is semantics? Why do we need to define behaviour formally? How do we choose one from the many possible semantics for a construct? Is there a "right" semantics for every construct? What is input-output behaviour? What are deterministic and nondeterministic behaviours? Can every partial function be extended to a total function by adding an undefined flag? How do we model the behaviour of any program containing while, goto, and recursion? Can one use tests that return a "don't know" flag? What happens when a program raises an exception? How do procedures work? What is encapsulation and information hiding? What is parallel execution and concurrent communication? What are type systems, classes and objects, inheritance and polymorphism? What is a program library? How do we define exactly the semantics of $$L$$? How do we specify the meaning of the data types and commands of $$L$$? How do we specify the operation or dynamic behaviour of programs of $$L$$? How do we specify the input-output behaviour of programs required by users? What is a program trace? What is the relationship between the number of steps in a computation and its run time? What exactly does it mean for two programs of $$L$$ to be equivalent?

Expressiveness and Power

How expressive or powerful is $$L$$? Can $$L$$ implement any desired data type, function or specification? Which specifications cannot be implemented in $$L$$? Is $$L$$ equally expressive as another programming language $$L^\prime$$? There are four possibilities:

$$L$$ and $$L^\prime$$ are equivalent;
$$L$$ can accomplish all that $$L^\prime$$ can and more;
$$L^\prime$$ can accomplish all that $$L$$ can and more; or
$$L$$ can accomplish some tasks that $$L^\prime$$ cannot, and $$L^\prime$$ can accomplish some tasks that $$L$$ cannot.

What exactly does it mean for two languages $$L$$ and $$L^\prime$$ to be equivalent in expressiveness or power? Which are the most expressive, imperative, object-oriented, functional or logic programming languages? Are parallel languages more expressive than sequential languages? Is $$L$$ a universal language, i.e., can it implement any specification that can be implemented in some other programming language? Do universal languages exist? What is the smallest set of imperative constructs necessary to make a universal language?

Program Properties

What properties of the programs of $$L$$ can be specified? Are there properties that cannot be specified? What exactly are static and dynamic properties? What properties of the programs of $$L$$ can be checked by algorithms? What properties are decidable, or undecidable, when the program is being compiled? What properties are decidable, or undecidable, when the program is being executed? What is the relationship between the expressiveness or power of the language $$L$$ and its decidable properties? What properties of programs in $$L$$ can be proved? Given a program and an identifier, can we decide whether or not

the identifier appears in the program?
given an input, the identifier changes its value in the resulting computation?
the identifier changes its value in some computation?

Given a program and an input, can we decide whether or not the program halts when executed on the input? Given two programs, can we decide whether or not they are equivalent?

Correctness

How do we specify the purpose of a program in $$L$$? To what extent can we test whether or not a program of $$L$$ meets its specification? How do we prove the correctness of a program with respect to its specification? Given a specification method, are there programs that are correct with respect to a specification but which cannot be proved to be correct? Is there an algorithm which, given a specification and a program, decides whether or not the program meets its specification?

Compilation

What exactly is compilation? How do we structure a compiler from the syntax of $$L$$ to the syntax of $$L^\prime$$? What does it mean for a compiler from $$L$$ to $$L^\prime$$ to be correct? Is correctness defined using a class of specifications, or by comparing the semantics of $$L$$ and $$L^\prime$$? How do we prove that a compiler from $$L$$ to $$L^\prime$$ is correct?

Efficiency

Are the programs of $$L$$ efficient for a class of specifications? If $$L$$ and $$L^\prime$$ are equally expressive will the programs written in them execute equally efficiently? Can $$L$$ be compiled efficiently on a von Neumann machine and is the code generated efficient? Are imperative programming languages more efficient than logic and functional languages? Are programs only involving while and other so-called structured constructs, less efficient than those allowing gotos? Are parallel languages more efficient than sequential languages?

Attempting to answer the questions should reveal what the reader knows about programming languages, before and after studying this book.