Optimal Subsampling Methods in Bike Sharing Data Analysis

Bike sharing system are popular around the world. Traditional bike sharing system require the bikes to be returned to ﬁxed stations, while morden system allows users to leave bikes wherever they like, ready for the next user to pick them up. Smartphone use GPS signal to keep track of its bikes and monitor where most bikes are used and where to place them. Smartphone simultaneously collect many other information such as weather condition, temperature and so on, these features have inﬂuence on the delivering amount of bikes. Due to the extensive number of smartphone users, big data technique is requried to handle this situation. We apply subsample method to this smartphone collected big data. In this paper, we derive non-uniform sampling distributions and propose optimal subsampling algorithm. We apply the proposed optimal subsampling algorithm to analyze the smartphone collected bike sharing data set, perfrom extensive computer experiments to evaluate the numerical performance of the proposed sampling algorithm. Our results indicated that the proposed optimal algorithm outperformed the uniform method and have faster running time than using the whole data set.


Introduction
With the universal spread of the smartphone, it is more convenient to collect users' data. Big data analysis is inevitable required to deal with the smartphone collected data. A typical example of smartphone collected data is bike sharing data, people use mobile app to get access to the bike sharing system and consequently generate huge amount of data to process. Our goal is to use the smartphone collected information builting model to predict the number of delibered sharing bikes for different locations. Due to the nonnegative integer characteristic of the response varibale, we refered this kind of data as count data. Count data are observations of the number of occurrences of a behavior in a fixed period of time, for example, hospital visits, blog comments, car/bike renters, and questionnaire respondents.Analysis of count data is an important task in social sciences and economics. Since linear regression does not take into account the restricted number of count response values, it is not an appropriate technique for count data. GLM model is the best suitable choice for count data. Meanwhile, these smartphone collected data satisfy the definition of big data. Big data are on a massive scale with regard to volume, velocity, variety, and veracity that exceed both the capacity of the conventional software tools and operating systems and the physical spaces of computers, see e.g. Fan, et al. [1]. These smartphone collected data shares two inherent charateristics of the big data:(1) the data is too large for a desktop to store, and (2) the computing task takes too long time to accomplish. Dvide-andconquer approach easily solve the first problem, see e.g. Bhlmann, et al. [2]. The subsampling approach simultaneously solve both of them. In this paper, our focus is to choose a subsample as a surrogate for the full data, use GLM model completing the data analysis. Uniform sampling is often used to take subsample, since it is simple in mathematic and easy in computation. Monte Carlo and bootstrap method are the two representative cases, see e.g.Freedman, et al. [3],Barbeet al. [4]. Although uniform subsampling provide a solution for intensive computing, it can not detect important observation and is not effective in extracting information. We consider non-uniform sampling to overcome this drawbacks. Mathematicians, computer scitists and statisticians have already made important progress in this area. Drineas, et al. [5,6] constructed fast Monte Carlo algorithms to approximate matrix multiplication and developed randomized algorithms for faster least squares approximation. Drineas, et al. [7] presented a sampling algorithm for the least squares fit problem and studied its algorithmic properties. A key feature of the above algorithms is the non-uniform sampling. Drineas, et al. [8] also proposed the non-uniform distribution to develop fast algorithms to approximate the product of two matrices, the idea is to minimize the expected squared Frobenius distance of the product and its approximate. Ma and Sun et al. [9] used the leverage scores as non-uniform importance sampling distributions for big data linear regression. Zhu, et al. [10] obtained optimal subsampling distributions for large sample linear regression and proposed the OPT and PL subsampling method in linear regression model, they discussed the sampling probability by minimizing the trace of the intermittent part of the variance-covariance matrix of the subsampling estimator, derived asymptotic normality and performed simulations and real data analysis. Wang, et al. [11] (2017) proposed non-uniform subsampling probabilities that minimize the asymptotic mean squared error of subsampling estimator in logistic regression, established consistency and asymptotic normality of the estimator. Xu, et al. [12] studied subsampled newton methods with non-uniform sampling. Wang, et al. [13] developed information-based subdata selection for large linear regression. Peng and Tan [14,15] investigated A-optimal subsampling for Big Data linear regression and constructed fast algorithms. Avron, et al. [16] used random-sampling and randommixing techniques to describe a fast LS solver for dense highly overdetermined systems. Ma, et al. [17] conduct the leverage score based non-uniform subsampling method, this method used the estimate from a subsample taken randomly from the full sample to approximate the full sample ordinary least square estimate, they proposed BLEV, SLEV, and LEVUNW method to perform subsampling. Their work is mainly in linear regression and logistic regression area. In this paper, we focus on smartphone collected bike sharing data, we extend the sub-sample method beyond linear regression and logistic regression. We summarize our main contributions as follows: -We generalize the subsampling method into more complex data structure, expanded this method to analyse smartphone collected bike sharing data, demonstrate the feasibility to other count data situation. -We develop the non-uniform distributions by minimizing the trace of certain variance-covariance matrices.which results in our less variance estimator, we show the consistency and asymptotic normality properties of our estimator, we pursue statistical inference of our non-uniform distributions. -we introduce the two stage algorithm to reduce the computing time, we conduct extensive computer experiments to illustrate our results, we seek the algorithmic properties.
The rest of the paper is organized as follows. In Section 2, We introduce the smartphone data collection and integrate subsampling method into the framework of count data analysis. We also apply subsampling method to GLM model and derive the corresponding optimal ditributions. In Section 3, We introduce the two stage algorithm. In Section 4, we show our computer experiment result, we compare our proposed sampling methods to the uniform sampling, we compare the runing time of our methods to the full data. We present the performance evaluation results of our proposed methods. Finally, we conclude our work in section 5.

Smartphone Data Collection System
Bike sharing systems allow people to rent a bike at one of the rental stations scattered around the city, use them for a short traveling. The rental station includes one terminal and several bike stands. When people use a bike, they use their smartphone scan the QR code on the bike, the smartphone APP records the real time transaction and communicates with the central control facility, if the transaction was approved, the central control facility send signal to open the locker, during this time, people can use the bike. After people finish the travel, they lock the bike, the bike send signal to the central control facility, the central control facility records the location of the capable bike. Operators of bike sharing systems make this information available to the users online, other people could get to know where the bike are capable of using. A important factor in the success of a bike sharing system is its ability to meet the 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 demand for bikes at each station. Our goal is to predic the amout of bikes delived at each station. The delivery amount of sharing bike have close relationship with several key features, The information of the key feature are getting from the smartphone. When the user begin to use the smartphone APP to rent the bike, the smartphone starts to collect the data, it records the data at each time point, the intervals between each time point are preseted. When the user returns the bike. we take averaged values as our variable value. The temperature, humidity and some weather condition related variables are optimal choices as the predict variables.

Optimal Subsampling Method
The dependent variable of the smartphone bike sharing data is the amount of delivered bikes, since its inborn characteristics of non-negativity and integer attribute, we refer this kind of data as count data. In a count data analysis model, the mean of a count response y i and covariate vector x i satisfy where β ∈ R p is a regression parameter and h is an inverse link function. Typically, h(t) = exp(t)(the inverse log link). The estimateβ is obtained by maximizing the likelihood function, which is equivalent to solve the following equation, For ordinary sample size, apply Neton's method to iteratively obtain the numerical solution, Where β k is the parameter estimate at the kth iteration, stop the process unitil βk+1 −β k < ǫ, ǫ is the preseted value to control the convergence. We then consider the case that the sample size n is very large and the estimateβ is not available. In this situation, We take a subsample as a surrogate of the whole sample, use the subsample estimate to apprroximate the whole sample. Let π i , i = 1, . . . , n be a sampling distribution on the n data points. A special case is π i = 1 n for the uniform distribution. We use the sampling distribution to take a subsample without replacement. The subsample is represented as (x * i T , y * i ), i = 1, . . . , r, with the subsample size r < n and π * i , i = 1, . . . , r are the corresponding sampling probabilities. We now approximate the maximum likelihood estimateβ by the subsample estimate which solves the equation Similarly, the subsample parameter estimateβ r can be caculated from the following equation. where Under certain condition, subsample estimateβ r is consistent with full sample estimateβ, moreover, as n → ∞ and r → ∞ where The subsample estimatorβ r have the two good properties, consistency property shows thatβ r is asymptoticly unbiased, normality ensures the subsample estimator to be capable to make statistical inference. Uniform sampling distribution is not a good choice, the next step is to find optimal distributions. The purpose is to find distributions to minimize the varince matrix V and V m . We choose trace criterion to minimize a matrix. Through minimizing the trace of V and V m , several types of distribution were proposed: where

Two Stage algorithm
By looking at (7)- (12), the optimal sample distributions are all depend on the full sample estimatorβ. Since in the frame work of big data analysis, the full sample estimateβ is usually not available. We propose a two stage algorithm to approximate optimal sample distributions.
In stage one, we take a uniform subsample to get a pioneer estimate ofβ, which can be used to approximate the optimal sample distributions, then we use these approximated optimal sample distributions to draw the second stage subsample, caculate our subsample estimateβ r based on the second stage subsample.
-Stage 2 Take subsample with replacement of size r using the sampling probabilities obtained in Stage 1. Caculate the subsample estimateβ r according to (5).
The two stage algorithm reduces the computation time, totally we only use r 0 +r data points. Our caculation process involve iteration, so we only take a portion of the whole data set greatly speed up the computation time.

Smartphone Bike Sharing Data Results
This data set is available from the UCI Machine Learning Repository website. Bike sharing systems can be considered new generation of old-fashioned bike rentals. The use of this system is not restricted to rentals and returns at the same docking station, bikes can be returned to any docking station after usage. Predicting the hourly bike request will help in planing, expanding and maintaining adequate number of bikes. In United States, the bike sharing system has been proved to be very successful in major cities, including Washington, DC, New York, Chicago, Los Angles, where bikes sharing has become a popular transportation option. Our goal in this example is to build statistical models to predict hourly request of bikes in Washington DC area. There are totally 17,389 observations in this data set, the response variables are the casual bike rentals. The predictor variables are season, workingday, daytime, weathersit, temp, hum and windspeed. All these features are available from the users'smartphones. The season variables include spring indicator, summer indicator and fall indicator, winter is the reference level. Variable workingday indicates whether a day is a working day, the reference level is weekend of holiday. Variable daytime indicates if the time is between 7am to 22pm, with the referencing time range from 0am to 5am representing the reference level night time. The season, workingday, daytime variables can be directly obtained from the users' smartphones. Weathersit varaible has 4 categories, the first category represents clear, few clouds, and partly cloudy; the second category represents mist plus cloudy, mist plus broken clouds, and mist; the third category represents light snow, light rain; and the fourth category represents heavy rain, and snow plus frog. Since the fourth category only has three observations, we combine the third and fourth categories. We choose the first category as the reference. Temp is normalized temperature in Celsius. Hum is normalized humidity. Windspeed is normalized wind speed. The weathersit, temp, hum and windspeed data is collected in this way. There are 11 regression coefficients for the predictor variables including the intercept.
The computer experiments are carried out in the follwing way, we take a uniform subsample of size r 0 = 200 to get an initial estimatesβ r0 ; then plug in to the formulas (7)(8)(9)(10)(11)(12) to approximate the optimal subsampling distributions; take subsample of size r according to the approximate distributions to getβ r ; We calculate the empirical mean square errors for different subsample sizes r for each of B = 1000 subsamples using the 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 formula: whereβ r,b is the estimate from the b th subsample with subsample size r.  Fig. 1 Percentages of the 95% confidence intervals which caught the full sample MLEβ 2 Figure 2 shows that the coverage probabilities of proposed methods achieve the nominal 95% in both small and large subsample sizes. Table 2 The length ratios of the 95% confidence intervals of proposed subsampling to uniform subsampling, ubsample size r = 400π  Table 2 shows the confidence interval length ratios of the proposed methods versus uniform method. First of all, all of the values in Table 2 is less than 1, indicating that the 95% confidence intervals constructed by our proposed methods have shorter length compared to uniform subsampling. Second, Table 6.4 shows that the coverage probabilities of proposed methods achieve the nominal 95% in both small and large subsample sizes.
That is, our methods are more accurate while maintaining the nominal coverage probability.  Figure 2 is the MSE plots, as subsample size r increases, we found that the largest MSE's over different subsample sizes are given by the uniform methods, the smallest MSE's over different subsample sizes are given by theπ (2) sampling. All the prposed methods are under the uniform method line, which means our methods have smaller MSE.   Table 3 contains the MSE ratio of the proposed sampling methods to uniform sampling for different subsample sizes r. In Table 3, all the values are less than 1, indicating our proposed sampling distributions produce smaller MSE than the uniform sampling. The highest reduction is about 65%.π (k) produce smaller MSEs thanπ (k) ,π (2) is the best amongπ (k) , andπ (2) is the best amongπ (k) , k = 0, 1, 2. In order to evaluate the computation efficiency, we report the running times for computingβ r by usinĝ π (2) . The experiment is carried out using R programming language. Those values were computed on a desktop with Intel i5 processor and 8GB memory. We recorded the CPU times for 1000 repetitions, then average the time to make the comparison fair. The results are in Table 4. We observe that theπ (2) require more time than π (2) method. All the proposed methods have significant less computing times than the full data In Table 5, we can see all the proposed methods have similar number of iterations, indicating smaller subsample sizes do not necessarily increase the iterations for Newton's method.

Conclusions
In this paper, subsamping method was introduced, we proposed optimal subsampling algorithm and establish the consistency and asymptotic normality theorem. we apply optimal subsample algorithm to the smartphone collected bike sharing data. Experimental results shows our proposed methods have better performance than the uniform sampling methiod in terms of less variance and less computation time compared to full data. There are important issues that we will investigate in the future.