*This chapter describes pipelines, trainable operations on data.*

- 11.1. Introduction
- 11.1.1. Execution on new data
- 11.1.2. Accessing pipeline steps
- 11.1.3. Displaying pipeline details
- 11.1.4. Untrained pipelines

# 11.1. Introduction ↩

In perClass, processing or transformation of data is described using the concept of a pipeline. Let us take, as an example, training of a linear classifier:

**>> load fruit**
**>> a**
'Fruit set' 260 by 2 sddata, 3 classes: 'apple'(100) 'banana'(100) 'stone'(60)
**>> p=**`sdlinear`

(a)
sequential pipeline 2x1 'Gaussian model+Normalization+Decision'
1 Gaussian model 2x3 single cov.mat.
2 Normalization 3x3
3 Decision 3x1 weighting, 3 classes

The object `p`

is a pipeline comprised of three stages. The first one is a
Gaussian model computed in the input 2D feature space. The model describes
three classes ('apple','banana' and 'stone') and, therefore, provides three
corresponding outputs (probability densities). The second stage is a
normalization turning the density into posterior. Finally, the third stage
converts the posteriors in a decision providing a single integer output.

Pipelines in perClass are not limited to classification. They describe all types of data processing, including data scaling, feature extraction and selection.

## 11.1.1. Execution on new data ↩

The pipeline `p`

may be applied to any data set with two features using the
multiplication operator `*`

:

**>> data=sddata(data)**
5 by 2 sddata, class: 'unknown'
**>> out=data*p**
sdlab with 5 entries, 2 groups: 'apple'(3) 'stone'(2)

Output of our pipeline is an `sdlab`

object with classifier decisions. In
perClass 4, all classifiers produce decisions by default.

The pipeline execution is an analogy of a matrix multiplication. Our
pipeline `p`

acts as a matrix with two rows ( feature inputs) and one
output (decision).

The multiplication operator is only syntactic sugar, the real work is done
by `sdexe`

function:

**>> **`sdexe`

(p,data)
sdlab with 5 entries, 2 groups: 'apple'(3) 'stone'(2)

If we execute pipeline on raw data matrix, we obtain raw numerical output:

**>> data=rand(5,2)*100**
data =
41.4248 77.6399
36.8954 4.8470
85.0896 59.0271
79.7602 15.8238
35.0236 93.7622
**>> out=data*p**
out =
3
1
1
1
3

The mapping between integer decisions and decision names is handled by pipeline list:

**>> p.list**
sdlist (3 entries)
ind name
1 apple
2 banana
3 stone

We may use the list object to convert decisions into names and vice versa:

**>> p.list(3)**
ans =
stone
**>> p.list('apple')**
ans =
1
**>> p.list(out)**
ans =
stone
apple
apple
apple
stone

## 11.1.2. Accessing pipeline steps ↩

Unless specified explicitly, all pipeline operations refer to the last step:

**>> p**
sequential pipeline 2x1 'Gaussian model+Normalization+Decision'
1 Gaussian model 2x3 single cov.mat.
2 Normalization 3x3
3 Decision 3x1 weighting, 3 classes
**>> p.output**
ans =
decision

We may access individual pipeline steps using parentheses `()`

:

**>> p(1).output**
ans =
probability density

Say, we wish to extract "soft outputs" of our classifier just before turning them into decisions:

**>> p(1:2)**
sequential pipeline 2x3 'Gaussian model+Normalization'
1 Gaussian model 2x3 single cov.mat.
2 Normalization 3x3
**>> out=data*p(1:2)**
5 by 3 sddata, class: 'unknown'

The output is now a data set, because the second pipeline step returns real-value output.

A quick shorthand for removing decision step is a unary minus (`-`

)
operator:

**>> -p**
sequential pipeline 2x3 'Gaussian model+Normalization'
1 Gaussian model 2x3 single cov.mat.
2 Normalization 3x3

We may, therefore, get classifier soft outputs using:

**>> data*-p**
5 by 3 sddata, class: 'unknown'

Applying the unary minus to a data set which already returns soft output has no effect:

**>> --p**
sequential pipeline 2x3 'Gaussian model+Normalization'
1 Gaussian model 2x3 single cov.mat.
2 Normalization 3x3

## 11.1.3. Displaying pipeline details ↩

Similarly to data sets and labels, perClass provides a quick shortcut for
displaying details about a pipeline with a transpose operator (`'`

):

**>> p'**
sequential pipeline 2x1 'Gaussian model+Normalization+Decision'
1 Gaussian model 2x3 single cov.mat.
inlab: 'length','color'
lab: 'apple','banana','stone'
output: probability density
2 Normalization 3x3
inlab: 'apple','banana','stone'
lab: 'apple','banana','stone'
output: posterior
3 Decision 3x1 weighting, 3 classes
inlab: 'apple','banana','stone'
output: decision ('apple','banana','stone')

For each step, we can see the input/output labels and the type of
output. We can see, that out pipeline `p`

expects two input features,
namely 'length' and 'color'.

This information may be accessed using pipeline fields `inlab`

, `lab`

and
`output`

:

**>> p(1).inlab**
sdlab with 2 entries: 'length','color'
**>> p(3).output**
ans =
decision

## 11.1.4. Untrained pipelines ↩

Usually, we create pipelines by training them on a data set. However, in
some situations, it may be more beneficial to create a pipeline description
*without* a concrete data set. Such pipeline is called *untrained*.

An untrained pipeline is created by providing the first empty `[]`

.

The trained Parzen classifier:

**>> a**
'Fruit set' 260 by 2 sddata, 3 classes: 'apple'(100) 'banana'(100) 'stone'(60)
**>> p=**`sdparzen`

(a)
.....sequential pipeline 2x1 'Parzen model+Decision'
1 Parzen model 2x3 260 prototypes, h=0.8
2 Decision 3x1 weighting, 3 classes

The untrained Parzen classifier:

**>> u=**`sdparzen`

([])
untrained pipeline 'sdparzen'

By multiplying a data set with untrained pipeline, we train it:

**>> p2=a*u**
.....sequential pipeline 2x1 'Parzen model+Decision'
1 Parzen model 2x3 260 prototypes, h=0.8
2 Decision 3x1 weighting, 3 classes

Note, that the order is always `data * pipeline`

.

Untrained pipelines are useful to separate the definition of a classifier from its training on data. We may provide any parameters when defining an untrained pipeline:

**>> u2=**`sdneural`

([],'units',20,'iters',1000)
untrained pipeline 'sdneural'

Untrained pipelines are used, for example, by `sdcrossval`

to perform
evaluation by cross-validation:

**>> **`sdcrossval`

(u,a)
10 folds: [1: ....] [2: .....] [3: ....] [4: ....] [5: .....] [6: .....] [7: ....] [8: .....] [9: .....] [10: ....]
ans =
10-fold rotation
ind mean (std) measure
1 0.09 (0.02) mean error over classes, priors [0.3,0.3,0.3]
**>> **`sdcrossval`

(u2,a)
10 folds: [1: ] [2: ] [3: ] [4: ] [5: ] [6: ] [7: ] [8: ] [9: ] [10: ]
ans =
10-fold rotation
ind mean (std) measure
1 0.08 (0.01) mean error over classes, priors [0.3,0.3,0.3]