To faciliate the channelization for potentially large numbers of channels, it is assumed that the proposed scheme is to be implemented in a *highly-parallel fashion* that makes efficient use of *pipelining* and single-instruction multiple-data (SIMD) *multiprocessing* techniques [1]. The target hardware is assumed, for purposes of illustration, to be a fieldprogrammable gate array (FPGA) device [9] such that each input sample may be read into its buffer at the rate of one sample per clock cycle. Then every N/4 clock cycles – referred to hereafter as the *update time* – the input data buffer to the PDFT module is updated with N/4 new samples and the operation of the polyphase filtering – and all subsequent processing – repeated.

Thus, this results in a *timing constraint* being imposed upon the operation of the proposed channelization scheme in that the *time complexity* of each sub-system must be such that the outputs can be updated within just N/4 (or some multiple of for the case of the highresolution FFT) clock cycles. This means that each of the subsystems must be broken down into components which, when suitably pipelined and synchronised, enable this to be achieved.

## 4.1 Double Buffers and Processing Elements

The technique of *double buffering* will need to be utilised in several places to ensure that pipelined operation of the channelization scheme is achieved. The basic idea is that whilst one buffer is being updated with new data, the data from the other is being processed by the appropriate processing elements (PEs). When the buffer is completely refreshed each time with new data this is quite a straightforward task. When overlapping of the data is involved, however, a more complex scheme is required. To achieve this, new data and data from the *buffer being processed* are fed simultaneously into the *buffer being updated* and the operation of the two buffers alternated with the availability of each new set of data – see Fig. 6 for the case of the complex-data FFT.

For such operation to be achieved, however, each buffer must be *partitioned* into the form of a *memory bank* – there will have to be at least four memories per bank for consistency with the 75% overlapping requirement – with each memory being typically made up from fast, dualport random access memory (RAM) [9]. In this way, multiple samples may be read/written from/to the memory bank every clock cycle, with one read/write pair or two reads/writes from/to each memory, so that the data may be fed into the appropriate PEs at the required rate for processing to be maintained.

A description of how the pipelining might be achieved is given in Figs. 7 and 8, with two levels of granularity being described for the PEs via the introduction of both *coarsegrain* processing elements (CPEs), as shown in Fig. 7, and *fine-grain* processing elements (FPEs), as shown in Fig. 8. An adequate timing margin will be sought, in each case, to allow for potential delays arising from the pipelined operation of the various interconnecting components.

## 4.2 Operation of Polyphase Filterbank

The components catered for by CPE1, as illustrated in Fig. 8, make up the polyphase filterbank, which may be implemented in the fashion of Jones [8] but with eight (rather than two) short, highlyparallel branch filters – each typically requiring just four to six fast multipliers and executing within a single clock cycle – running in parallel where each is operating upon an appropriately defined subset of the input data stream that is contained within its own memory. This enables a new set of polyphase filtering outputs to be produced every

TPFIRs = N / 8 + log2L (6)

clock cycles, where ‘L’ is the number of filter taps per branch, which is well within the allotted update time. A few stages of additions are required at the output of the short branch filters in order to complete the production of the filtered outputs. A double buffering scheme may be devised for the branch filters which would ensure that whilst one buffer is being updated with new data, the data from the other is being fed into and processed by the branch filters. The operation of the two buffers, each comprising a bank of eight memories – one memory for each branch filter – would need to alternate every N/4 clock cycles.

## 4.3 Operation of Complex-Data FFT

The components catered for by CPE2, as illustrated in Fig. 8, make up the complexdata FFT, which may be implemented as a computational pipeline where each stage of the pipeline, for the case of a radix-4 FFT, carries out the execution of N/4 *butterflies* [12,13]. The availability of two highly-parallel radix-4 butterfly processors running in parallel – each typically requiring 16 fast multipliers and executing within a single clock cycle – for each of the log4N stages of the FFT will thus enable a new output set to be produced every

clock cycles, again well within the allotted update time. The complex-data FFT is illustrated in Fig. 8 by means of a 1024-point radix-4 algorithm, so that five stages of butterflies are required for its pipelined computation. Double buffered memory is placed between successive stages, with each buffer comprising a bank of eight memories to ensure that the butterfly processors can operate at the required speed.

It should be noted that the overlapping of the input data segments to the polyphase filters results in an additional timing constraint in that for synchronisation of the two FFTs being simultaneously performed via the complex-data FFT – namely, that required by the PDFT module and that required by the row-DFT stage of the highresolution FFT module – it is necessary that M×N windowed samples are made available to the FFT every M×N/4 clock cycles. Given that it takes M×N clock cycles to move this amount of data into the buffer, it is necessary that the data stored by the two buffers (as required for double buffering) should also be 75% overlapped.

This may be achieved with each buffer being partitioned into a bank of four memories, as already seen in Fig. 6, with the operation of the two buffers alternating every M×N/4 clock cycles. The entire set of M×N input samples must be available in the required buffer before the processing can begin, however, as each set of ‘N’ samples for input to a given large Npoint FFT will come – according to the index mapping – from different parts of the data buffer. As a result of the overlapping of the input data, the highresolution FFT spectra are produced at four times that obtained with contiguous data sets, so that delays in detecting changes to the signal environment may be kept to a minimum.

## 4.4 Operation of DDC Units

The components catered for by CPE3, as illustrated in Fig. 8, make up the DDC module, where each unit is carried out in three (or more) stages with the first stage performing the frequency shifting of the signals of interest, the second stage the application of the filtering coefficients and the remaining stage(s) the summing of the results. Given that the channelization process reduces the sampling frequency out of the PDFT module by a factor of N/4, when compared to the initial system sampling frequency, the computational demands placed upon each DDC unit will not be great. In fact, for each N/4 new samples input to the PDFT module, there will be just one sample being input to each DDC unit that has been assigned to a PDFT channel.

The situation is further simplified by noting that the sampling frequency is consistent with the maximum signal bandwidth of interest, so that for the case where the signal bandwidth is significantly less than the channel bandwidth, a further reduction in the sampling frequency out of the DDC unit may be obtained.

## 4.5 Operation of Column-DFT Stage of High-Resolution FFT

The components catered for by CPE4, as illustrated in Fig. 8, make up the columnDFT stage of the highresolution FFT, which involves the computation of ‘N’ short M-point FFTs. The entire set of M×N input samples must be available in the required buffer – containing the appropriate unpacked outputs from the complex-data FFT – before the processing can begin, however, as each set of ‘M’ samples for input to a given small M-point FFT will come – according to the index mapping – from different parts of the data buffer.

The short Mpoint FFTs may be straightforwardly pipelined, however, with a small number of preweave addition stages being followed by a single nested point-wise multiplication stage which is in turn followed by a small number of post-weave addition stages. A single, maximally parallel implementation of the M-point FFT, able to produce all ‘M’ outputs within a single clock cycle, would enable a new set of column-DFT outputs to be produced every

TFFT−HR = N+‘pipeline length’ (8)

clock cycles, which for M > 4 and N > > M, is within the allotted update time (for the high-resolution FFT) of M×N/4 clock cycles. Double buffering is again used to enable one completed data set to be processed whilst the other is being updated with each buffer being partitioned into a bank of ‘M’ memories.

## 4.6 Remaining Components

The remaining components, namely those of CPE5, carry out the additional tasks required by the SPE module concerning the centre frequency and bandwidth estimation of each signal of interest followed by the selection of the channels of interest – namely, those PDFT channels that contain in their entirety the signals of interest. The algorithms required for carrying out these functions, however, will not possess the regularity of those based upon the FFT and the FIR filter, although one would not expect the computational demands, in view of the comparatively low channel sampling frequency, to be prohibitive.