Kernel density estimation and correntropy based background modeling and camera model parameter estimation for underwater video object detection

Underwater video object detection is challenging because of the complex background and the movement of the camera. In order to address this, we propose a novel scheme for simultaneously estimating the camera model parameters and detecting the object. The object detection phase includes background modeling and its learning. Background is modeled by the proposed spatial kernel density estimation (SKDE) model and the model learning happens in the SKDE feature space. Background modeling and its learning is pixel-based approach. The model histograms learn the new pixel through its histogram representation. Our learning and classification strategy is different from that of the algorithm proposed by Heikkila et al. in the year 2006 in the context of similarity measure. We have developed the correntropy-based similarity measure strategy that is used for model learning and pixel classification. The camera model parameters are estimated by 2D optimization method where we have used the corner features of an object at subpixel accuracy level. These subpixel level features are used in the pipelining framework for model parameter estimation. The estimated model parameters are used to transform the input frame, which in turn is used for model learning and classification. The proposed scheme has been tested with underwater video frames from six datasets. The efficacy of the proposed scheme is compared with seven existing schemes and it is found that the proposed scheme exhibits improved performance as compared to the existing methods.


Introduction
Moving object detection in video sequences is one of the fundamental tasks for a number of video processing applications such as recognition, tracking, understanding the behavior of the moving object. In order to detect the object of interest in a specific scene, the background of the scene has to be modeled. Because of the complex environment, the task of building and updating a background model has now become a major issue in the field of computer vision.
Many researchers Hao et al. (2013); Wintenby and Svensson (2015); Yang and Liu (2011); Stauffer and Grimson (1999); Chen et al. (2019) have been working on this chal-B Susmita Panda susmitapanda@soa.ac.in 1 Image and Video Analysis Lab, Department of Electronics and Communication Engineering, Institute of Technical Education and Research, Siksha 'O' Anusandhan (Deemed to be University), Bhubaneshwar, India lenging issue to detect the moving object in a scene having a dynamic background. The typical methods for moving object detection are foreground extraction and background subtraction. Foreground extraction techniques classify pixels according to the changes in the incoming frames, while background subtraction suppresses the background by comparing an incoming frame to the background template. Broadly, the background modeling techniques are classified as; (i) parametric model and (ii) nonparametric model. In case of parametric modeling, the background is mostly modeled using either a single Gaussian distribution or mixture of Gaussian distributions Stauffer and Grimson (1999); Migdal and Grimson (2005); Ge et al. (2016). Some of the situations like the waves on water or trees shaken by the wind, where pixel intensities surrounding these objects tend to vary significantly over time, pose challenges. To overcome such issues, the probability distribution of the pixel intensities is estimated independently for each frame. Such type of schemes which involve estimation of pixel intensity distribution directly from the data is categorized as nonparametric model. Here, the model Chen et al. (2019); Rashid and Thomas (2016) adapts the fast changes in background process and detects a target with high sensitivity.
The background modeling task becomes challenging in case of complex scenes, both inland and underwater. Underwater object detection represents one of the most challenging scenarios due to the presence of dynamic entities like waves on the water surface, movement of the object with variable speed, waves created by boat and uneven illumination condition in the scene. Owing to the physical properties of the underwater environment such as light absorption, scattering, density and restrictive human access, visibility in underwater is greatly hampered. These conditions are the motivating factors for this research work to address the problem of underwater object detection. As the traditional methods Bouwmans (2014); Jaffe (2015); Piccardi (2004); Xu et al. (2016) of moving object detection have proved not to be very effective for the underwater environment problems, Liu et al. Liu et al. (2016) proposed an effective and reliable method to detect moving object from underwater video by combining the notion of background subtraction and three frame differences followed by morphological processing. Similarly, in case of Underwater environment, Prabowo et al. (2017) have developed a method to detect the object by subtracting the current frame pixels from the pixels in the previous frame background model. The complexity of such type of environment is further compounded if the video sequence is captured by a moving camera. In case of moving camera, Stolkin et al. (2008) have proposed Expectation Maximization (EM)-based tracking algorithm for poor visibility condition. Besides, Panda et al. Panda and Nanda (2015) have also addressed the incomplete data problem while simultaneously estimating the camera position and image labels using Expectation Maximization (EM) algorithm and Extended Markov Random Field (E-MRF). Further, a new spatiotemporal Markov Random Field-based model has been proposed by Panda et al. Panda and Nanda (2020) for simultaneously estimating the camera model parameters and detecting the object.
In this piece of work, a new scheme is proposed for simultaneously estimating the camera model parameters and learning the background model. In the background modeling phase, we have estimated the probability density of every pixel of a video frame using the Kernel Density Estimation(KDE) technique. Spatial neighborhood pixels are used for the density estimation and hence this is known as the spatial KDE (SKDE) model. These estimated densities of pixels of a frame are used as the features for background modeling and model learning. The histogram distribution of a window around a given pixel of the SKDE frame serves as the model for that pixel. For a given pixel, the histograms related to the corresponding pixels in the temporal direction of SKDE frames serve as the model histograms of the given pixel. The principle of model learning is different from that of Heikkila et al. Heikkila and Pietikainen (2006) in the context of proximity measure which is based on the correntropy measure strategy. These model histograms of a given pixel learn the histogram of the corresponding pixel of the incoming frame. After learning, depending upon the degree of proximity, each pixel is classified either to be a background pixel or a foreground pixel. In the above process, all the pixels of the incoming frame are classified as either background or foreground, thus the entire frame is classified to detect the object. The classified frames obtained by the above process are used for the estimation of the intrinsic and extrinsic parameters of the moving camera in the underwater environment. The camera parameters are estimated based on the 2D optimization method using the notion of pipelining. The estimated model parameters are used to transform the frames, which are subsequently used as the input frames for SKDE modeling and model learning. Thus, in this process, the camera model parameter estimation and object detection are carried out simultaneously. The proposed scheme has successfully been tested on underwater video frames of six datasets. The results obtained by the proposed scheme are compared with seven existing techniques and the novel algorithm exhibited improved performance in the context of different quantitative measures.
This paper is organized as follows. Section 2 deals with the related research works, while the proposed scheme is presented in Sect. 3. SKDE-based modeling and model learning are provided in Sect. 4 and the camera model parameter estimation is provided in Sect. 5. Results and Discussion are presented in Section 6 while the concluding remarks are presented in Sect. 7.

Related work
Different classification schemes based on the notion of background subtraction (BS) are proposed in the literature Bloisi et al. (2014); Xu et al. (2016);Bouwmans (2014). In the pixelbased approach Stauffer and Grimson (1999); Vemulapalli and Aravind (2009);Migdal and Grimson (2005); Ge et al. (2016), the change of each pixel in the temporal direction is considered as an independent process. This method is used for real-time classification of moving objects. In the literature, authors have also considered region-based algorithms Mittal and Paragios (2004); Heikkila and Pietikainen (2006); Spampinato et al. (2014); Goyal and Singhai (2018) where a frame is divided into blocks and the block-based features are used to detect the foreground. In a region-based algorithm, the histogram of that region is computed and edges are preserved while removing the noise. H. Liu et al. Liu et al. (2016), Singla et al. Singla (2014) and Zhang Zhang et al. (2012) have developed Per-frame-based algorithms that could detect global changes in the scene particularly in poor visibility conditions. Background modeling can also be grouped into Multi-stage category Toyama et al. (1999) where several steps are performed at different stages to improve the accuracy of the final result. Cheung and Kamat Sen-ching and Cheung (2004) proposed two types of BS methods namely the Recursive and Non-Recursive. In case of Recursive algorithm Stauffer and Grimson (1999); Migdal and Grimson (2005) a single background model is recursively updated on each new incoming frame. Here, the researchers have used Gaussian mixtures to model a single pixel. In the case of Non-Recursive approach Peng and WeiDong (2012), the authors maintain a buffer of previous frames and estimate the background model based on a statistical analysis of the available frames in buffer. Background subtraction methods are also broadly categorized as Predictive and Non-predictive by Mittal and Paragios Mittal and Paragios (2004). Predictive algorithms Mittal and Paragios (2004) model the scene as a time series and develop a dynamic model to recover the current input based on past observation. Non-predictive methods Stauffer and Grimson (1999); Elgammal et al. (2000) neglect the order of input sequence and create a probabilistic representation of the observation at a particular level.
Further, in complex environment, the backgrounds are modeled using statistical approach to detect the foreground Bouwmans (2014). These statistical models can be categorized as parametric and nonparametric. In the case of the parametric approach, the probability density function of the pixel process is represented parametrically using a prescribed statistical distribution. The parametric-based approaches Stauffer and Grimson (1999); Migdal and Grimson (2005); Ge et al. (2016) have limitations in handling dynamic environments in an underwater environment. Alternatively, in nonparametric approach, the density function can be directly obtained from the pixel without any assumptions about the underlying distribution. Though this approach Qiao and Xi (2017); Kakizawa (2018); Miao et al. (2012);Elgammal et al. (2002); Giordano et al. (2014); Wintenby and Svensson (2015) is able to construct statistical representation of foreground or background, it is not able to learn all the changes of a dynamic background, especially the changes on the water surface. In order to handle such dynamic background, researchers have extended the temporal approach to develop the spatiotemporal models which is presented in Vemulapalli and Aravind (2009). Alvarez et al. Álvarez Meza et al. (2016) have proposed an adaptive background model within an adaptive learning framework considering the spatiotemporal relationships among pixels. Recently, authors Maity et al. (2020) have attempted to minimize the effect of different video irregularities like dynamic background, change in illuminations and video noise by spatiotemporal region persistence (STRP) descriptor and adaptive threshold. Further, to enhance the discriminative ability of the back-ground model, Zhong et al. Zhong et al. (2019) have proposed a dual target nonparametric background model for classifying a pixel either as static object or dynamic background.
The maritime backgrounds are more complex than other dynamic backgrounds since waves on the water surface do not belong to the foreground despite being in motion. Additionally, the problem is more compounded due to poor illumination conditions. Srividhya et al. Srividhya and Ramya (2017) have extracted different statistical features like autocorrelation, sum of entropy which are used by the learning algorithms to classify underwater objects. To detect underwater moving object, Liu et al. Liu et al. (2016) have proposed an underwater object detection scheme which combines the notions of background subtraction and three frame differences under the assumption of a fixed camera position. Similarly, the authors in Prabowo et al. (2017) have addressed an adaptive background modeling method to detect moving objects in underwater video. Vasamsetti et al. [39] have proposed a new feature descriptor, a multiframe triplet pattern for underwater moving object detection. With increasing demand of smart phone cameras, H. Sajid et al. Sajid et al. (2019) have proposed a hybrid method that combines motion and appearance in an online framework for foreground/background segmentation of videos. Further, researchers Szolgay et al. (2011); Zamalieva and Yilmaz (2014) have also tried to detect moving objects in complex scenes by assuming a camera in motion.

Proposed scheme
The proposed scheme is shown in Fig. 1. As observed from Fig. 1, for continuous video object detection, the camera model parameters and the background model learning are alternated to simultaneously learn the background and detect the object. As observed from Fig. 1, at a given time t , the input t th frame is transformed by the previously estimated camera model parameters of (t −1) th frame. Thereafter, spatial KDE (SKDE) of the transformed frame is found out. Background modeling and model learning are pixel-based processes. For a given pixel of SKDE frame, the histogram of the window around the pixel contributes to the learning of the corresponding model histograms. In other words, the model histograms of the corresponding pixel are updated. Subsequently, the pixel is classified as either a background pixel or foreground pixel. This process of learning and classification is repeated for all the pixels of a given frame. Thus, the learning and classification take place for the entire frame. The classified frame at t th time instant is used with a few previously classified frames for parameter estimation.
Thus, as observed from Fig. 1, the object in a frame is detected by completion of all the processes once in the entire

Model initialization Phase
Block 1 of Fig. 2 denotes the initialization phase where few model frames are selected. The model histograms are generated as follows. For any given pixel, a few SKDE frames in temporal direction are chosen. Windows of a given size are constructed around the pixels of the frames and the histograms of these windows serve as the model histograms for that pixel. For example, when we consider to have 3 model histograms for a given pixel of t th SKDE frame, we consider the corresponding pixels of (t − 1) th , (t − 2) th and (t − 3) th frames and construct windows of a given size around them. The histograms of these windows of (t − 1) th , (t − 2) th and (t − 3) th frames serve as the background model histograms for the given pixel at t th frame and these model histograms are updated to learn the information from the subsequent frames. Similarly, model histograms are generated for every pixel of the SKDE frame.

Background Modeling and Model Learning
Model learning takes place in block 2 of Fig. 2. For learning of tth frame, the entire frame is transformed by the camera model parameters estimated at (t − 1) th frame as shown in block 7. Since learning is pixel based, learning of a given pixel of t th transformed SKDE frame, a window of a given size is considered around the pixel of the t th frame and the histogram of this window is considered for learning of the model histograms. For learning, the similarity check of this histogram with each of the model histograms is found out. The proposed correntropy measure between the histograms is considered as the similarity measure. The histogram of the pixel of the t th frame is compared with each of the model histograms in the context of similarity measure. Based on the value of correntropy of each model histogram, a weight is assigned to the model histogram. The model histograms are now updated bin-wise based on the adaption rule. Besides, the weights of the model histograms are also updated. This process of updation constitutes the learning phase of the background model for the given pixel. After learning of a given pixel, classification of the pixel takes place. This process is repeated for all the pixels of the t th input frame to complete the model learning and the classification of the entire frame. This is shown in block 3 of Fig. 2.

Model Parameter Estimation
As seen in Fig. 2, the camera model parameter estimation phase follows the background learning phase. The classified t th frame thus obtained is fed to the camera model parameter estimation block which is shown in block 4 of Fig. 2. The classified frame is pushed to a pipeline which has earlier been filled up with the classified frames of (t − 1) th , (t − 2) th ....,(t − 5) th frames. As shown inside block 4, in the pipeline six classified frames are used for camera model parameter estimation based on 2D optimization method. Both the intrinsic and extrinsic parameters are estimated. The extrinsic parameters are used to transform the next input frame as shown in block 5. Thereafter, the SKDE of the frame is determined in block 7 and this SKDE frame is used for learning in block 2. These processes of learning, classification and parameter estimation continue for subsequent frames.

KDE-based background modeling and model learning
Since the background of the underwater is complex because of the poor visibility and dynamic conditions, we have adhered to SKDE-based approach for background modeling. KDE of a given pixel is computed using a set of data points. Usually in video frames, KDE of a pixel is computed using the corresponding pixels in the temporal direction Elgammal et al. (2002). But, in our case, we evaluate the KDE of a given pixel considering the spatial neighborhood pixels and therefore, to differentiate it from normal video frames, we have denoted it as spatial KDE (SKDE) of the pixel. Similarly, the SKDEs of all the pixels of a frame are evaluated and the entire frame is designated as SKDE of that frame. This is intended to capture the local spatial attributes of the scene for a given pixel. This will help to preserve the shape of the object while detecting the moving object in different frames. Since the learning is a pixel-based process, the model histograms will learn the pixel through its histogram that takes care of information of given pixels and its neighborhood pixels. The learning helps to differentiate background and foreground pixels for classification. In this KDE-based background modeling, spatial KDE (SKDE) of the video frames are computed and the histogram of the window around a given pixel of the SKDE frame is considered as the model of the pixel. Background model histograms for a given pixel are found by the histograms of the same pixel in the temporal domain. In the learning phase, Correntropy measure is used as the similarity measure between the incoming histogram and the model histograms.
In the following, we present the KDE estimation process in the spatial domain and correntropy-based similarity measure.

Spatial kernel density estimation (SKDE)
KDE aims to produce a smooth, continuous estimate of a univariate or multivariate probability density using a positive kernel function, i.e., K σ (X ; σ ) which is controlled by a bandwidth σ Elgammal et al. (2002). Given a sample set S = {x i, j } i, j=1...N consisting of pixel intensities, an estimate of probability density functionp c at a position c, i.e., center pixel within a group of pixels X i ; i = 1...M can be calculated using, where X denotes an input frame and M denotes the total number of pixels in the neighborhood of center pixel c.
Here, we choose the kernel function as a Gaussian function. The bandwidth acts as a smoothing parameter controlling the trade-off between bias and variance in the result. High bias is obtained with a large value of bandwidth, while low bias or low variance is obtained with a small value of bandwidth. Hence, the probability density estimated at the center point can be expressed as, (2) Since we have considered only the spatial neighborhood pixels of a given pixel at c site, the KDE found at site c , i.e.,p c is called as spatial KDE (SKDE). Spatial KDE maps the pixel intensity with the probability density function. This estimation is expected to remove false detection due to the waves in the water surface or the random noise which occurs due to the uneven illumination. In this piece of work, the bandwidth of the Gaussian Kernel is assumed to be constant for all the pixels of a frame. In our work, the SKDE of a pixel is computed as follows. A window of a given size is considered around the pixel. The KDE of a given pixel is computed using (2) where c denotes the center pixel, X m denotes the neighborhood pixels of the given pixel and M denotes the number of neighborhood pixels in a window. This process is repeated for all the pixels of a given frame to obtain the SKDE of the frame. Similarly, SKDE of all the frames are computed and these SKDE frames are the feature frames which are used for background modeling and model learning.

Correntropy: A similarity measure
A nonlinear statistical measure of similarity between two random variables is named as Correntropy. It is a generalized correlation measure between two random variables induced by a kernel function. This is a single measure that includes time structure and the statistical distribution as stated in Zhao et al. (2011);Singh and Principe (2009);Liu et al. (2006); Santamaria et al. (2006). Liu et al. Liu et al. (2006) defined Correntropy between two arbitrary random variables Y 1 and Y 2 as, where Y 1 = Y t1 and Y 2 =Y t2 . Thus, correntropy is an extension of auto-correntropy between two random process. The name correntropy comes from the fact that its mean value is the argument of the logarithm of quadratic Renyi's entropy of Y 1 − Y 2 . It has a maximum value at the origin ( 1 √ 2πσ ) and it is also a symmetric positive function. In our work, we have used correntropy for measuring the similarity between the new input histogram distribution and model histograms. The Correntropy between the n th bins of the two histogram is expressed as, where k σ (.) is a positive definite kernel, where the kernel width is determined by the parameter σ and a n , b n are the n th bins of the corresponding histograms. The Correntropy values for all the bins of the histograms are computed and the similarity measure between two histograms is the average of all the Correntropy values. As finite number of samples available, the following sample estimators are used for the expectation operator as in Singh and Principe (2009).
where N is the number of samples present within the kernel. We assume k σ (.) to be a normalized Gaussian kernel with variance σ . Hence, (5) can be expressed as, Thus, Correntropy can be viewed as a generalized correlation function between two random variables, containing higher order moments of the error (a n − b n ) between them. It measures the similarity between two random variables within a small neighborhood determined by the kernel width σ .

Background modeling
In this work, background modeling is carried out in feature space instead of the raw data space. Few initial frames are chosen and the spatial KDE (SKDE) of these frames are found out. The SKDE frames are considered to be the feature frames. The modeling and model learning are pixel-based approaches. For modeling a given pixel of the SKDE frame, a window is constructed around the pixel and the histogram of the window serves as the model of the pixel. In order to obtain the model histograms of a pixel, the corresponding pixels in the temporal directions are considered and the histograms of these temporal pixels serve as the model histograms. The number of model histograms may vary, but for the sake of illustration, three such model histograms are shown in Fig. 3. Thus, these three histograms are considered as the model histograms for the pixel. In the learning phase, the corresponding model histograms are updated for each input pixel of the new input frame. The updation process of these model histograms together with the updation of the weights are known as model learning.

KDE-based model learning
The model learning of feature frames is a pixel-based learning process. With the background model histograms of SKDE frames of Fig. 3, model learning takes place with every new input pixel of the new frame. For example, for background modeling of a pixel at tth time instant, the histograms of the corresponding pixels of few past frames, i.e., (t − 1) th , (t − 2) th and (t − 3) th frames are considered as model histograms for background modeling which is shown in Fig. 3 Correntropy between the new input histogram and any of the model histograms corresponds to the best match between the new histogram and model histograms. If the Correntropy value of the input histogram with any of the model histogram is above a preselected threshold T p , then the pixel is considered as a background pixel or else foreground pixel. If the correntropy value is below the threshold T p for all the model histograms, then the histogram with the lowest weight is replaced by H n . This replaced H n is assigned with lowest weight. Thereafter, the best match model histogram is updated with the new input histogram by the following bin updation procedure.
whereĤ ok is the estimated model histogram, H ok is the best match model histogram and H n is the histogram of the new frame presented for learning and α 1 is the learning parameter.
The weights of the model histograms are updated as follow, where α 2 is user defined parameter and H k is unity for best matching histogram and 0 for others. Thereafter, the model histograms are sorted based on decreasing order of the weights and the first N histograms are selected as the background model histograms based on the following condition.
where T B is a user defined parameter. The above process is repeated for all the pixels of the input frame and the pixels are classified. In this process, the model histograms for all the pixels of the new SKDE frame are updated. Here the similarity measure is the correntropy measure. If the correntropy of the histogram H ni with any of the model histograms is above a threshold T p then the pixel is classified as background and otherwise the pixel is classified as foreground or object region. In other words, if the correntropy values H ni with all the model histograms is less than the threshold T p , then the pixel is classified as foreground or object. Thus, by classifying all the pixels of a given frame, the object in that frame is detected. The flowchart of model learning and classification algorithm is given in Fig. 4. The salient steps of the algorithm are enumerated below.

Camera model
Accurate estimation of camera model parameters leads to the detection of the video object accurately. We have estimated the camera parameters using the notion of pipeline as shown in Fig. 5. After the classification of input frames, features are extracted from these frames for parameter estimation. The accuracy of estimation depends upon the proper choice of features of different views. In order to obtain features corresponding to the shape of an underwater moving object,  improved Harris corner detection algorithm Qiao et al. (2013) has been used to extract the features for parameter estimation.
We have used five stages in the pipeline to estimate the camera parameters. Hence, at a given time, features of five views of a given video have been used to obtain the estimates of the parameters. In the following, we present the process of parameter estimation. The pipeline consists of five stages and initially, all the stages are empty. As shown in Fig. 5, at T = t − 4 time slot, the corner features are input to the pipeline and the rest four stages are with null features. Estimation of parameters with the available features in the pipeline would result in the inaccurate estimates. Thereafter, the features corresponding to view 1 (first frame) are shifted to the next stage thus enabling the features of view 2 to occupy the first stage. This process is continued and at time T = t, all the pipeline stages are filled up with the features of the respective frames and the parameters estimated using these features are expected to be the correct estimates. As Camera model parameters are estimated based on the 2D optimization method proposed by Zhou et al. Zhou et al. (2012) and Zhang et al. Zhang (2000). With parameter vector θ , i.e., θ = ( f x , f y , u 0 , v 0 , R, t) T , the objective function is developed to minimize the distance between the estimated image pointî u and the distorted image pointî d u , whereî u denotes the feature point mapped to the image coordinate system andî d u is the corresponding distorted image points. Let f x , f y denote the respective focal lengths and u 0 , v 0 denote the initial positions of the camera coordinate. R and t denote the rotational matrix and the translation vector, respectively. The objective function as proposed by Zhou. Zhou et al. (2012) is expressed as, In this research, the image frames are considered from six datasets. For the sake of illustration, Fig. 6 shows the Harris corner features of the whale in the frame. Firstly, these corner points are mapped into the camera coordinate plane and thereafter they are mapped into the image coordinate plane. In our work, lens distortion has not been taken into account and hence, the distance between the estimated image point in the image coordinate planeî u and real image point i u is minimized. Hence, the estimated point in the image coordinateî u is a function of both the intrinsic parameters ( f x , f y , u 0 , v 0 ) and extrinsic parameters R and t, i.e.,î u = f( f x , f y , u 0 , v 0 , R, t). The parameter vector θ is obtained by minimizing the following. Therefore, in this case, the prob- Since the camera is in motion, the new input frame has been transformed by the estimated extrinsic parameter matrix consisting of rotational angle θ and translational parameters t x , t y , t z . As the segmented frame and the estimated camera parameters are interrelated, hence the transformed input frame has been subjected to spatial KDE. The block diagrammatic representation of the process of camera calibration is presented in Fig. 7.

Camera model parameter estimation
It is known that proper choice of feature points contributes predominantly for the accurate estimation of parameters. Because of underwater environment, it may be difficult to extract the appropriate feature points. In order to ameliorate the issue, steerable pyramid filters with different angles are used for different frames. Steerable filters are used to obtain different features of a given frame with different orientations. This filter is recursive in nature and hence the k directional bandpass filter can be expressed as, where m=0,.......,k-1 and, where S is the radial variable in frequency space and θ = tan −1 ( v u ) is the angular variable in frequency space. HP(a,b,f) is a high pass transfer function, raised to cosine.
The kernels at different angles have been applied to the considered frames for feature extraction. Further, the proper features points have been extracted by the use of Steerable pyramid filters by exposing different surfaces of the object. It has been reported in Qiao et al. (2013) that the corner points serve as the feature points in the checkerboard image. In our case, the underwater video objects are from six datasets. Harris corner detection algorithm has been the choice for detection of corner points. Though Harris corner detection operator can take care of image rotation, gray change and noise interference, it has the limitation in detecting the corner points corresponding to the coordinate of pixels points only. From a practical standpoint, it may be conceivable that accurate corners may correspond to subpixel coordinate positions instead of pixel coordinates. Hence, in our case, the accurate corners of the underwater object may correspond to subpixel accuracy. This motivated us to adhere to the improved Harris corner detection algorithm with subpixel accuracy. We have used the improved Harris corner subpixel corner detection algorithm Qiao et al. (2013) in different video frames. For a given frame, the feature points are weighted to take care of the orientation and movement. We have assigned different weightages to the set of feature points of different frames intending to take care of the movements in different frames. These weighted features are mapped to the coordinate frame. Following are the salient steps of the camera parameter estimation as presented in Fig. 6:

Combined algorithm for object detection and model parameter estimation
The object detection and camera model parameter estimation are carried out once in each epoch of the combined algorithm.

Algorithm 2 Parameter estimation algorithm
Input: Feature points of segmented frame. Output: Estimated Intrinsic and Extrinsic camera parameters. 1 . Transform the feature points I w in the world coordinate system to the camera coordinate system using the extrinsic parameter R,t. These features points in camera coordinate systems are denoted as I c .
2 . Project the camera coordinate point I c using the intrinsic parameters ( f x , f y , u 0 , v 0 ) and this point is denoted asî u . 3 . Compute the distance between theî u and i u for every image point and minimize the following objective function.
Use Levenberg Marquardt algorithm to solve the following optimization problem.
One epoch of the combined algorithm detects object in one frame. Thus, for continuous object detection and parameter estimation, the combined algorithm corresponding to Figs. 1 and 2 is executed. The flowchart for the combined algorithm is presented in Fig. 8. Salient steps of the algorithm are enumerated below.

Results and discussion
We have considered different views from 6 datasets namely Creepy chimara / Nautilus live video Anon (2016), Blainville's beaked whale dataset Anon (2015a), Whalesharks in Philip- The intrinsic and extrinsic camera model parameters are estimated using the proposed weighted corner features of the previously classified frames. These weighted features of five previous frames are used in the different pipeline stages to estimate the model parameters. There are five pipeline stages as shown in Fig. 5 and these stages are necessary for estimating the parameters with minimum error. Hence, at a given point of time, features of five views of a given dataset have been used to obtain the estimated parameters. For example, in the case of dataset 1, features of classified frames with frame numbers 24, 25, 26, 27 and 28 are pushed into the pipeline to estimate the camera parameters of 28 th frame. Similarly, in order to estimate the camera parameters of frame number 31, features of the classified frames with frame numbers 27, 28, 29, 30 and 31 are pushed into the pipeline. In case of dataset 2, camera parameters of 16 th frame are estimated by pushing frames numbered 12, 13, 14, 15 and 16 into the pipeline. Similar process is performed to estimate the parameters of 18 th frame of the same dataset by pushing features of frames numbered 14, 15, 16, 17 and 18 into the pipeline. The same process is repeated for datasets 3, 4, 5 and 6. Table 1 presents the estimated intrinsic parameters of different datasets. These parameters correspond to the optical centers and focal lengths of the moving camera. In the parameter estimation step, the parameter vector θ which consists of both intrinsic and extrinsic parameters is estimated using the parameter estimation algorithm of Sect. 5. These parameters are obtained using the features of the classified frames of the pipeline which consists of five frames. These parameters are estimated for each frame of the video. Since different cameras were used for different datasets, the intrinsic parameters differ from each other. In order to test the efficacy of the parameter estimation strategy, two different views of the same dataset (Dataset4) have been considered. As observed 4 th and 5 th column of Table 1, the estimated intrinsic parameters are close to each other, as the frames considered are from two views of the same dataset. Camera calibration error is the difference between the actual image point and the estimated position.
The accuracy of the estimated parameters depends upon the calibration error, the less the calibration error the better is the accuracy of estimation. The calibration errors for different datasets are provided in Tables 2 and 3, where it may be observed that the calibration errors are of low values. Further, as observed from Table 2, in case of dataset 1 the calibration error for frame number 31 is less than that of frame number 28. In the case of the 2 nd dataset, the calibration error of 18 th frame is less than that of 16 th frame. Similar observations are also made for datasets 3, 4, 5 and 6 as presented in Tables  2 and 3. Hence, the estimated parameters with low value of calibration errors are acceptable. Figs. 9 to 11 show the segmented results of different frames from different datasets. As observed from the original frames, there is a single moving object in a dynamic and unevenly lighted background. The results obtained by our proposed algorithm are compared with that of Stolkin et. al,Prabowo et. al,Liu et. al,Elgammal et. al,10a and 11a correspond to the original frames while Figs. 9b, 10b and 11b show the corresponding ground truth frames. Figs. 9c-i, 10c-i and 11c-i show the results obtained by different algorithms while Figs. 9j, 10j and 11j show the results obtained by the proposed algorithm. Visual inspection of the results of different frames reveals that there are different degrees of misclassification in the object and background portions. In some cases, many object portions are not detected properly. But as observed from Figs. 9j, 10j and 11j, the proposed algorithm could detect the object with a minimum amount of misclassification error and also the background could be detected properly. As observed from the results obtained by the proposed algorithm, in some cases the shape of the object has been retained but because of some false positive cases the object appears to be a bit of different sizes as compared to the original object.
where TP is true positive, FP is false positive and FN is false negative.

70.0
Boldfaced data correspond to the best results

81.0
Boldfaced data correspond to the best results

85.5
Boldfaced data correspond to the best results The fourth quantitative measure considered is the Dice Coefficient which is defined as, where S denotes the segmented image, GT denotes the ground truth, FG and BG corresponds to the foreground and background, respectively. The last quantitative measure is F-Measure which is defined as, For 16 th and 18 th frames of dataset 2, it is observed that the values of Recall, Dice coefficient and F-measure for the proposed algorithm are highest values among all, whereas the precision value is more than two existing algorithms but less than that of the other four algorithms. This is attributed to the false positives in the object portions. As seen from Table 4, for the 16 th and 20 th frames of Dataset 3, the Recall, Dice coefficient and F-measure values are highest among all the algorithms considered. But in this case, the precision is 84.9% for the 16 th frame, which is higher than those of three algorithms and comparable to one algorithm and less than that of two algorithms. Hence, in this case, both Precision and Recall values are high thus indicating that the object has been detected. Similar observations are also made for the 20 th frame. Visual inspection of the results of Fig. 9 also reveals that there is almost no change in the size of the detected object.
For the four different frames of Dataset 4, as observed from Table 4, the Recall values are highest indicating that the object could be detected. The values of the Dice coefficient and F-measure are also high and are comparable to others. But the Precision is less than that of other algorithms. This observation is similar to those of Dataset 1 and 2. Table 5 presents the quantitative measures for 5 th and 6 th datasets. As observed in all these four frames, the Recall values are highest among all the algorithms. In case of 46 th and 156 th frames of 5 th dataset, the Dice coefficient values are highest among all thus indicating the accuracy of the detected objects. The F-measure values for the proposed algorithm are highest for these frames. Similar observations are also made for 265 and 388 th frames of 6 th dataset. Further, the average quantitative measures for all the datasets are found out and are presented in Table 6. We have considered 15 frames or more in each dataset to determine the average measures and the number of frames considered in each dataset is presented in Table  6. As observed from Table 6, the recall values for the proposed algorithm are highest among all the algorithms. The F-measure and the Dice coefficient values are also highest one in five cases. Thus, in all the cases, the proposed algo- rithm exhibited improved performance as compared to other algorithms. The programs for all the algorithms have been developed by us using c code and these algorithms are implemented in a machine whose specifications are Intel® Core i3-3217U CPU @ 1.80GHz ×4, 4 GB RAM and 500 GB storage. The time of execution for each dataset is presented in Table 7. As observed from Table 7, the execution time for the combined algorithm of a frame of 3 rd dataset is 31 secs which is the minimum one. Further, as observed, the execution time increases with increase in the size of the frame. The execution time is expected to decrease further with the enhanced features of the computing machine. Hence, the proposed algorithm could detect the underwater object under poor visibility and dynamic background condition with the movement of the camera.

Conclusions
In this work, attempts have been made to detect the moving underwater object when the camera is moving in the same environment. Our proposed scheme therefore estimated the camera model parameters which are subsequently used for object detection. The object detection phase is based on background modeling and its learning. Background modeling and model learning are carried in SKDE feature space to deal with the existing complexity of the background. We have computed the SKDE of frames which are used as the feature frames. Background modeling and model learning is pixelbased approach where each pixel of SKDE frame is modeled by its histograms. In the learning phase, our correntropybased similarity measure determines the proximity of the histogram of a pixel of the incoming frame with that of the model histograms. It is observed that the proposed modeling and model learning strategy could take care of the complex background of the underwater environment. The camera model parameters are estimated which are used to transform the input frame before presented for learning. Therefore, it has been observed that the learning and classification depend upon the accuracy of the estimates of the camera model parameters. It has also been observed that the accuracy of the estimated parameters depends on the proper choice of features. The proposed pipeling approach with improved Harris corner detection method resulted in good estimates of the parameters, thereby resulting in proper transformation of frames for learning. The accuracy of the parameters were based on minimizing the calibration error and it has been found that the estimated intrinsic parameters are close to the available camera parameters. The proposed scheme could be successfully tested on six underwater datasets and the results obtained are superior to the existing ones both qualitatively and quantitatively. In the camera model, we have not considered the lens distortion and hence the pinhole camera model can be modified to take care lens distortion for modeling real lens. The proposed algorithm's performance deteriorates with partially occluded objects or in other words in partially submerged objects. This will lead to inadequate number of feature points and in turn inaccurate estimate of the parameters which will degrade the performance of the algorithm. In the SKDE modeling, the Gaussian kernel bandwidth σ is chosen by trial and error. Future work includes taking care of lens distortion factor and estimating the kernel bandwidth. Further, devising a novel scheme for detecting the objects with optimum number of feature points even with partially submerged objects is worth pursuing. Thus, the proposed scheme may be used in machine vision system for underwater object detection.
Author contributions Susmita Panda(1 st author) involved in formulation and development of algorithm, validating the algorithms with different datasets and manuscript preparation. Pradipta Kumar Nanda(2 nd author) involved in conceptualization of the problem, problem formulation, and validating results and manuscript.

Funding Information
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.