Labels

Friday, May 29, 2015

Using R to fit the distribution of vehicle count for a signalized link

Still remeber this post?
"Use Python to plot the vehicle count between two consecutive intersections" at:
http://transportationbigdata.blogspot.com/2015/05/using-python-to-plot-vehicle-count.html

Once we've got these plots, we ask: is this sample following certain distribution? Or equivelently, the question is: how to fit the observation with certain distribution?

Maybe Python could be used to do this work. However, I decided to try R out for this job.

=====================================================================
Two alternative packages you should consider for this are library(MASS) and library(fitdistrplus). I tested MASS.

Here I have a sample set: eSample

Code below  fits different distributions to this sample set with return:

e.norm<-fitdistr(eSample, "normal")</pre>

This code will determine the parameters for the specified distribution according to the sample data set. For example, when we specify a "normal" distribution, the fitdistr() will return mean and variance.

To evaluate the goodness of fit by visual, method of "QQPlot" is recommended.

QQplot requires two set of data, X as the random data from theoretical distribtuion, and Y as the random data from sample. For example, I need to check how good the fitdistr() result is, I generate a bunch of random data (x.norm) following the specified distribution (i.e., normal for this example). And then put x.norm and eSample to QQPlot as:


x.norm<-rnorm(n=1860, m=e.norm$estimate[1], sd = e.norm$estimate[2])
qqplot(x.norm, eSample)
abline(0,1)

The more overlap between qqplot and the abline, the better the fit.
====================================================================

For the data I got from field, different built-in distributions are tested as:

## fit the distribution to normal, or other user specified distributions
e.norm<-fitdistr(eSample,"normal")
e.wei<-fitdistr(eSample,"weibull")
e.gamma<-fitdistr(eSample,"gamma")
e.pois<-fitdistr(eSample,"poisson")
e.negbin<-fitdistr(eSample,"negative binomial")
e.lnorm<-fitdistr(eSample,"lognormal")

## produce samples from theoretical distributions
x.norm<-rnorm(n=1860, m=e.norm$estimate[1], sd = e.norm$estimate[2])
x.wei<-rweibull(n=1860, shape=e.wei$estimate[1], scale = e.wei$estimate[2])
x.gamma<-rgamma(n=1860, shape=e.gamma$estimate[1], rate = e.gamma$estimate[2])
x.pois<-rpois(n=1860, lambda = e.pois$estimate)
x.negbin<-rnegbin(n=1860, mu=40, theta = 5)
x.lnorm<-rlnorm(n=1860, meanlog=e.lnorm$estimate[1], sdlog=e.lnorm$estimate[2])

# plot subplots for each qqplot result
attach(mtcars)
par(mfrow=c(3,2))
qqplot(x.norm, eSample)
abline(0,1)

qqplot(x.wei, eSample)
abline(0,1)

qqplot(x.gamma, eSample)
abline(0,1)

qqplot(x.pois, eSample)
abline(0,1)

# it seems produce the best fit
qqplot(x.negbin, eSample)
abline(0,1)

qqplot(x.lnorm, eSample)
abline(0,1)

And the QQplot result could be visualized as:

As readers may notice, "negaive binomial" distribution best fit the sample data.

Monday, May 25, 2015

Set up the Python environment for your research project

To set up the environment for my research project, I need several tools as:

C# (comes with Visual Studio)


MS SQL server management studio


MatLab


Python


R.


=====================================================


C# comes with Visual Studio. Now one can get a free version of VS from Microsoft. This free version is called "community versity" of VS. It is compact, simplified but should meet most of your research needs, unless you are running a huge project and cooperating with a bunch of people. If this is the case, you probably should have VS purchased by your department/institute.


MS SQL server also has a free version called "Express".


For, MatLab, you need with pay for it or request your department/institute purchase and install it for you.


=====================================================


Let's talk about Python.


Python evolvs to 3.4. But tons of applications are still on 2.7. For me, 3.4 is better because I always want to pursuit the latest version. However if you know that your prj requires high compatibility, go to 2.7.


First step, download and install Python 3.4 or 2.7.


Second step, go and get an IDE for your convenient. 

Some guys claim they use notepad (from MS) or the opensource notepad++ for Python writing. I guess they are really expert in Python. If you are reading this post, I bet you are as fresh as me. So please find yourself a comfortable IDE. I personally use PyCharm from JetBrains (google it). Many other IDEs are also recommended by others like: Canopy, Spyder, etc. Go and have  try.

Third step, find and install the packages you need.

I guess the core packages I need would include Numpy and Scipy for scientific calculation, Matplotlib for professional plotting, pymssql for MSSQL database connection, and pip for package management. The following table shows how and where you can get them:


Package name
How to get it
Note and resources
Numpy and Scipy

download exe for win and install
download link: http://www.scipy.org/scipylib/download.html
Matplotlib

download exe for win and install
download link: http://matplotlib.org/downloads.html
pip

upgrade through PyCharm package manager
To access the package manager: File->Settings->Project: Python->Project Interpreter->choose yours, for example 3.4.2. Then you will find some packages installed with your Python. To add one, click the plus on right. To upgrade one, click it and then click the up arrow on right
pymssql

installed through PyCharm package manager

I have other posts regarding the installation of pymssql and numpy. If interested, you can check them out.

Up to now, you should have enough resources for your work. Have Fun!

=====================================================

Trouble shooting

The main problem I got during the process is the pip thing.

As some online resources suggested, I tried to install pymssql through pip command (it means: Start->cmd->type "pip install ****"). Every time I did so, I got error reading ""pip" is not recognized as an internal or external command".

The reason for this error is the wrong value of the item "PATH" in your system's "Environment variables". The  command "pip" locatees in your Python installation folder and is not added to this PATH. Go and take a look at it by a) type "echo %PATH%" in your cmd window, or b) right click "Computer"->Properties->Advanced system settings->Environment Variables. You may get a long PATH varible.

The solution to this is to change the PATH variable by adding "C:\Python34\Scripts\" at the end of PATH value. Make sure you add a semi coma before it.

Another thing I noticed was that, some Canopy path was in the path. I installed "Enthought Canopy" before PyCharm. And then found I prefer PyCharm over the Canopy, so I uninstalled it. But the path remained in the Env variable. I had to deleve them mannually.

If you readers encounts any other trouble when setting up your lovely (you are going to work with it for a long time, it deserves to be lovely), please leave your comments and let's discuss!  

====================================================

Friday, May 15, 2015

Real time travel time estimation from big data

Millions of cars on road generate tons of data every day. Vehicle communication technology (including drivers' smart phones connected with 3G/4G/WiFi) provides us a new way of accessing this rich dataset.

University of Michigan Transportation Research Institute (UMTRI) hosts a connected vehicle research project called SafetyPilot originally intending to improve traffic safety by utilizing vehicle communication technology. Each participating connected vehicle has a message broadcasting equipment, as well as a GPS sensor. Some more advanced vehicles may equiped with message receiving device and/or even onboard warning system. Basic Safety Message is the essential information transmitted by these connected vehicles and some Road Side Equipment (RSE). BSM contains two parts of information. The first part is mandatory, including vehicle's GPS location, speed, acc/dec rates, paddel status, ligth status, heading direction and much more. The second part is optional environmental information, including for example weather condition, bus schedule etc. These BSMs are transmitted via Dedicated Short Range Communication (DSRC) at 10Hz frequence (That's really alot of data).

Evey 3-6 month, these vehicles will come back to UMTRI and upload their data. Since the commencement of the project in 2012, UMTRI is now hosting more than 70 billion BSM records on more than 4 million trips, all come from about 3000 connected vehicles (3~4% of the total car ownership).

Diving into this database, we are able to locate the data records we need in order to produce a road travel time estimation for the City of Ann Arobr. Due to the data availability, we present the visualized map for this city at 07:00 to 10:00 on Dec 2, 2013. Without doubt, same data processing method could be replicated to get road travel time estimation for any other time window, and even real time.

The tool we used to produce the map is ArcGIS 10.2. We produced a map for each 10 minutes. Each map, as a frame in the video, will be shown for 3 seconds. For your convenience, static maps (click for larger view) are also included at the end of the post.



 

 

 

 

 

 

 

 

 



Tuesday, May 12, 2015

Use Python to plot the vehicle count between two consecutive intersections

# this is part of a research project in which we are estimating the real time vehicle count on a road segment
# this preliminary result is produced by processing vehicle count data from loop detectors embedded under two consecutive intersections
# we studied three consecutive links (i.e., in total four sets of detectors included)
# the time frame for the data is 0700~0800, Dec 01~31, 2008
# the vehicle count(T) = vehicle count(T-1) + upstream detector count - downstream detector count [1]

Figure 1: vehicle counts on each link during weekdays

Figure 2: vehicle counts on each link during weekends
The vehicle counts, as one may noticed, are sometimes negative.

This is because, as in [1], we don't have data on the vehicle count(T-1). It is assumed to be 0.

The consequent of this assumption is that, the ploted figures (could be also taken as the distribution), are shifted figures from the original real data. One can add a base on X to reconstruct the real situiation, as long as the base is well estimated/measured.

Friday, May 8, 2015

Install numpy

numpy is now an essential part of Scipy. It provides extended data types beyond Python basic data types. Personally I wan to import numpy for its matrix.

Here will explain very breifly how to install numpy on Windows.

Be aware that, numpy is differet from pymssql which could be unzipped to site-package folder directly. If you do as the way explained here as installing pymssql, and then "import numpy" directly, you will get error message reading that it doesn't contain a configuration file.

The reason is that the numpy package, written by C, needs compiling before use. To compile it you may need some C compilers. I don't know the exact way to do it, however.

The way suggested here, also the way I adopted is much more straightforward. Actually there are always contributors do those difficult tasks for us (express my gratefulness). For the latest version of numpy, one are suggested to check out this page and download an exe file. Remember to download the version in accordance with your Python version.


Connect python with SQL server

Problem description:

1. cannot install pymssql
2. cannot connect pymssql with my MSSQL


Below is what I did for the whole thing
Install pymssql:

1. install pymssql package. One may install it using 'pip install pymssql'. An alternative is install the package from file->setting->python interpreter->click '+'->search for pymssql->click install. Unfortunately, none of these methods works for me. Both of them created a folder named "pymssql-2.1.1.dist-info" without any .py files in it. I have to use the very basic way: download the .tar.gz file and unzip it to the Python34->...->site-packages. This method works.

1.1 Debug 1: When I use the example code, I got error. To solve it, do not "from os import getenv"

1.2 Debug 2: hard code the connection string. Error: "...unknow reason". Google it. Some one got similar problem years ago. The solution he/she proposed was to amend the freetds.conf file by appending some configurations of the database at the end of the .conf file. So goto Step 2.

2. install freetds. I don't quite know what it is. I though it should come with pyCharm or at least come pymssql. However, I could not find freetds.conf file from my file explorer. Again, download the .tar.gz file and unzip it to the same location.

2.1 Amend the .conf file. [myDBname]/n host = servername (note: without instance. for example, use myDB instead of myDB\SQLExpress)/n port = 1433/n tds version = 8.0 (someone use 7.0, I used 8.0, anyway, it works).

SQL Server:

1. Change database security to SQL Authentication and Win authentication

2. Manually set the port of SQLExpress at SQL management tool to be 1433 (the default value for my SQL is empty)

3. Add a user, set its server roles to be public and sysadmin (other roles exist, not sure if other roles will also work)

4. User mapping, map the user to the specific database you want to login


Smple codes:

import pymssql

server = "myDBenginename"
user = "userID"
password = "userPW"

conn = pymssql.connect("myDBenginename", "userID", "userPW", "DBname")
cursor = conn.cursor()

# you must call commit() to persist your data if you don't set autocommit to True
conn.commit()

cursor.execute("SELECT top 10 * FROM Tablename")
row = cursor.fetchone()

while row:
    print("ID=%d, Name=%s" % (row[0], row[1]))
    row = cursor.fetchone()

conn.close()

sample code 2 (freetds.conf):

 [myDBname]
host = servername # not quotation mark, no instance name (e.g., \SQLExpress)
port = 1433 # the same as you defined in SQL server management tool
tds version = 8.0 # someone suggest 7.0, anyway, 8.0 works as well

My Python developement environment

My Python environment:

OS: Win 7 64bit Eng
Python: Python 3.4
IDE: PyCharm Community Edition (free)

Pymssql and numpy (Scipy):

They are the two essential packages I need current stage.

Pymssql:
Pymssql is used to connect Python scripts and MSSQL server. One should noticed that the in some time 2011, pymssql changed its strategy. Now pymssql connects to MSSQL only via SQL authentication (in MSSQL, that means you change your MSSQL authentication mode to be SQL and Win, also known as mixed mode)

Numpy
Numpy appears always with scipy. numpy is a numeric package for Python, for example, it helps you define a matrix whichis not supported by Python by default. If one need some advanced scientific mathematical functions, use scipy. Before you can install scipy, you must have numpy installed. For details, google numpy or google scipy.

Problems encountered and solved:

Connect python with SQL server

Install numpy