Big Data in Transportation: Using R to fit the distribution of vehicle count for a signalized link

Still remeber this post?
"Use Python to plot the vehicle count between two consecutive intersections" at:
http://transportationbigdata.blogspot.com/2015/05/using-python-to-plot-vehicle-count.html

Once we've got these plots, we ask: is this sample following certain distribution? Or equivelently, the question is: how to fit the observation with certain distribution?

Maybe Python could be used to do this work. However, I decided to try R out for this job.

=====================================================================
Two alternative packages you should consider for this are library(MASS) and library(fitdistrplus). I tested MASS.

Here I have a sample set: eSample

Code below fits different distributions to this sample set with return:

e.norm<-fitdistr(eSample, "normal")</pre>

This code will determine the parameters for the specified distribution according to the sample data set. For example, when we specify a "normal" distribution, the fitdistr() will return mean and variance.

To evaluate the goodness of fit by visual, method of "QQPlot" is recommended.

QQplot requires two set of data, X as the random data from theoretical distribtuion, and Y as the random data from sample. For example, I need to check how good the fitdistr() result is, I generate a bunch of random data (x.norm) following the specified distribution (i.e., normal for this example). And then put x.norm and eSample to QQPlot as:

x.norm<-rnorm(n=1860, m=e.norm$estimate[1], sd = e.norm$estimate[2])
qqplot(x.norm, eSample)
abline(0,1)

The more overlap between qqplot and the abline, the better the fit.
====================================================================

For the data I got from field, different built-in distributions are tested as:

## fit the distribution to normal, or other user specified distributions
e.norm<-fitdistr(eSample,"normal")
e.wei<-fitdistr(eSample,"weibull")
e.gamma<-fitdistr(eSample,"gamma")
e.pois<-fitdistr(eSample,"poisson")
e.negbin<-fitdistr(eSample,"negative binomial")
e.lnorm<-fitdistr(eSample,"lognormal")

## produce samples from theoretical distributions
x.norm<-rnorm(n=1860, m=e.norm$estimate[1], sd = e.norm$estimate[2])
x.wei<-rweibull(n=1860, shape=e.wei$estimate[1], scale = e.wei$estimate[2])
x.gamma<-rgamma(n=1860, shape=e.gamma$estimate[1], rate = e.gamma$estimate[2])
x.pois<-rpois(n=1860, lambda = e.pois$estimate)
x.negbin<-rnegbin(n=1860, mu=40, theta = 5)
x.lnorm<-rlnorm(n=1860, meanlog=e.lnorm$estimate[1], sdlog=e.lnorm$estimate[2])

# plot subplots for each qqplot result
attach(mtcars)
par(mfrow=c(3,2))
qqplot(x.norm, eSample)
abline(0,1)

qqplot(x.wei, eSample)
abline(0,1)

qqplot(x.gamma, eSample)
abline(0,1)

qqplot(x.pois, eSample)
abline(0,1)

# it seems produce the best fit
qqplot(x.negbin, eSample)
abline(0,1)

qqplot(x.lnorm, eSample)
abline(0,1)

And the QQplot result could be visualized as:

As readers may notice, "negaive binomial" distribution best fit the sample data.

Big Data in Transportation

Labels

Friday, May 29, 2015

Using R to fit the distribution of vehicle count for a signalized link

1 comment: