Distributed & Parallel Processing

Presenter: Professor George Wells

July – August 2011

Description

This course is aimed at introducing the student to the general area of computer science known as distributed and parallel processing. A broad overview of the subject is taken, describing a variety of distributed parallel and client server options, from formal specifications to practical implementations. The content covered is wide ranging, and relatively shallow, and is intended to build upon a general undergraduate Computer Science knowledge. A special emphasis is placed on affordable asynchronous processing, currently the most prevalent model, and on models applicable to multicore processors, which are of ever-increasing importance.

This course is based closely on versions given in previous years by Dr. Peter Clayton and Dr. Karen Bradshaw.

Introduction

In the past two to three decades, the field of concurrent programming has been a prolific area of research, spurred on by the realization that concurrent algorithms frequently provide a more naturally expressed solution to many problems. Concurrent programming principles are no longer solely the domain of the implementors of operating systems, but are being applied to an ever increasing range of applications. Several authors affirm that when a concurrent solution is formulated in a sequential programming language, the mapping of the solution onto the sequential program is unnatural, and therefore error prone, difficult to maintain, and unreliable. Numerous programming languages have been developed for the expression of concurrent algorithms, and many distributed environments exist in which general programming notations can be used to express concurrent behaviour. In addition to the software engineering benefits that modern concurrent programming tools afford applications in which concurrent behaviour is a fundamental aspect of the problem area, they provide for the direct representation of processes which will execute in parallel on multiprocessor hardware. This enables the software designer to obtain increased performance, in terms of speed, or reliability, or both.

Coupled to the concept of concurrent processing (many simultaneous activities), is the opportunity of distributed processing (many simultaneous places). This quickly extends from multicore processors and multiprocessor architectures, to local area networks, and to wide area networks. The basic requirements for distributing applications are simple: to distribute application processes across multiple machines/processors/cores, to locate those processes, and to communicate among them. Of course, other services are required, such as: maintaining security of the data; accommodating different networks, operating systems, and data formats; providing transactional support; and accessing often incompatible data sources. In order to address these tasks in general networks, a proliferation of application development products identify themselves as middleware.

Because the degree of complexity rises rapidly in system that are spread about a network, with a lot of simultaneous activity, there is a strong need for formal software engineering approaches to the design and implementation of distributed and parallel systems.

Some Terminology

Concurrency and Parallelism: Two entities are said to be executing in parallel if at some instant in time both are actually executing. Entities are described as concurrent if they have the potential for executing in parallel. Therefore, programming languages or run-time environments are described as concurrent, rather than parallel. A concurrent programming language will have more than one thread of control, enabling code segments which could execute in parallel to be directly represented.

Processes, processors, and tasks: We define a task as an operation or a set of operations to be performed, and a process as an instance of the task (these terms are frequently used interchangeably in the literature). The term processing means the execution of an instance of the task. Processors are those things which carry out processes. In a computer system this is often taken to be an ALU (Arithmetic Logic unit) or CPU (Central Processing Unit). However, in most modern systems the ability of a single CPU to time-slice between several processes gives rise to the concept of a virtual machine or abstract processor.

After coming into existence, a process' life-history can be defined in terms of three primary states: executing, executable, and suspended. A process is suspended if it is delayed (most synchronization primitives can lead to delay). This state is also known as blocked. If it is not suspended then a process can either be executing, if there is a processor available, or be able to execute (but prohibited from doing so by the lack of a processor). Modern languages usually require their compiler to generate a run-time system that will manage the queues of suspended and executable processes, and to schedule the executable work when there are not enough processors for the executable processes. When moving from running one process to running another, the run-time system must also cater for state changes, which it does using a procedure known as a context switch. This will involve storing the volatile environment of the concurrent process and restoring the corresponding environment of the process that is due to run.

Classifying Computer Architectures: A popular method of classifying computer architectures, published by Flynn, considers the potential multiplicity of the instruction and data streams of a computer. The instruction stream is the sequence of instructions performed by a computer; the data stream is a sequence of data used in the execution of the instruction stream. Flynn's taxonomy identifies four classes of computers:

Distributed and parallel processing: As a rather vague working definition, we shall consider a distributed and parallel processing system as one which involves the simultaneous operation of multiple interconnected processors.

Application Development Environments: ADEs also span a wide array of products and services. ADEs generally provide a high-level development language, and usually include tools that facilitate cross-platform applications by accommodating differences in operating environments and user interfaces. Deployment of applications may require additional services such as network communications, application partitioning and distribution services, component location services, management, and cross-platform deployment services. These services may be an integrated part of the ADE, or the ADE may rely on other middleware and communications products.

Object Development Environments: Object development environments are designed for the development of reusable software components. In a distributed environment, the components (objects) usually interact through an object request broker (ORB). When an application is distributed, the ORB handles the requests that one object makes of another object, and provides the mechanism for locating and interacting with objects across the network. ORBs can also interact with and rely on other forms of middleware for application communication and distributed services.

Data Access: Database management systems traditionally house and manage the consistent access to data for an application. However, distributed applications need to access data from numerous back-end sources, often running on different platforms. Data access products allow developers to view disparate data sources in a consistent way. The vast majority of business logic resides in the client application, and database middleware (data passing) is targeted at providing a solution for two-tier architectures where the dataflow across the network to and from a remote database server is in the form of statements.

MOM — Message Oriented Middleware: MOM is an enabling software layer residing between the business applications and the network infrastructure that supports high-performance interoperability of large-scale distributed applications in heterogeneous environments. It supports multiple communication protocols, languages, applications, and hardware and software platforms. It resides between the business applications and the network infrastructure, or between applications themselves, depending on the implementation. MOM refers to the process of distributing data and control through the exchange of messages. It extends process-to-process communication in a distributed environment by providing message passing or message queuing models, supporting both synchronous and asynchronous communications. MOM lends itself to event-driven rather than procedural processing. Time-dependent and time-independent processing, as well as memory and disk-based systems, are all available.


Any enquiries concerning the material on these course pages should be directed to George Wells.

Last updated: 25 September 2011