August 2016

Sunday, August 21, 2016

Changepoint Detection : Theoretical Background

No comments : Posted by Unknown at 9:08:00 PM Labels: #GSOC , #postCP , #R-gsoc , #Tech

Introduction

Changepoints are abrupt variations in the generative parameters of sequential data. Changepoint detection is the process of identifying such abrupt variations and it has been useful in a wide range of applications such as EEG analysis, DNA segmentation, econometrics etc.

Bayesian ways of change point detection focus on segmentation and techniques to generate samples from the posterior distribution over changepoint locations.

Let's try to understand this without a lot of complex mathematical notations. When we are given a data set that varies according to time ( e.g: - a time series ), there might be sudden changes. Because of these changes the data set can be partitioned and separately analysed. Therefore if we assume that within a partition there is a relationship between data points, we can perform a regression to calculate regression coefficients. (Provided that we have an idea about the change points)

However this does not capture the uncertainty involved with the change points. This is where change point distributions come in to play. Following the Bayesian change point model, and by using a constrained HMM (Hidden Markov Model ) we are able to calculate the posterior distribution.

Therefore the posterior distribution can be calculated with a modified forward backward algorithm. The algorithm can be further improved to compute the most probable change point configuration.

The references below provide more insight to the postCP package.

[1] Luong, T.M., Rozenholc, Y. and Nuel, G., 2013. Fast estimation of posterior probabilities in change-point analysis through a constrained hidden Markov model. Computational Statistics & Data Analysis, 68, pp.129-140.

My Experience with GSOC and R

No comments : Posted by Unknown at 7:27:00 PM Labels: #GSOC , #postCP , #R-gsoc , #Tech

It all began when I started searching for a Google Summer of Code project last year (November, 2015) . While I was searching through the web found this page that suggested of a project idea. I didn't have a complete understanding about the problem but I contacted the mentors and familiarized myself with R and the theoretical background of the package. In March, I submitted a project proposal and got selected for GSOC finally. (Of course I had to submit a test to the mentors first)

The actual challenge began only then. My mentor, Professor Gregory Nuel sent me a paper he had presented which included a detailed explanation about the change point model and some code examples. I must say that it took a while to understand the document completely with the different notations used. I always kept bothering my mentors for further details whenever I had a question. A significant proportion of the complexity of the project was due to the theoretical background required to the project.

I started with simple steps. Model specific implementation required to choose a few models and finalize the calculations for the log evidence. I was able to finalize this with the help of my mentors. The core implementation of the forward backward algorithm was done in C++. I had several issues interfacing it with the R code using Rcpp. I was able to rectify them by adding useDynLib() in the NAMESPACE.

useDynLib(postCP)

After the core forward backward algorithms were interfaced, I could proceed with the core model part. Following the change point model, I coded to find the initial regression coefficients using standard R functions. The result was then used to calculate model specific log evidences which were input to the forward backward algorithm to compute posterior probability distributions using HMM. Finally the posterior change point probability distribution code be obtained.

Using the change point probability distribution, the parameters and the log evidence was updated. Finishing off the core implementation part of the package.

However at that time I found out that I hadn't followed a proper coding style guideline. Therefore I referred to "Hadley Wickam R style" and "R style. An Rchaeological Commentary". Well, all of this would be an utter waste if the user did not have any source to understand how to use the functions. Therefore I spent time on improving function documentation as well.

To provide an insight to the theoretical background behind the package, I added Vignettes which is a long form documentation in R packages.

Throughout the project I learned new things and I always tried to develop the postCP package in the most R compliant way. All the requirements have been met and the project has been successfully finished.

Here is the source folder which contains the improvements I made to the package.

https://drive.google.com/drive/folders/0B8xO3Cc0h6rIbXNDTUM1TDlMbDQ?usp=sharing

Here are the commits done throughout the project.

https://github.com/malithj/postCP_Improvement/commits/feature-glmsyntax?author=malithj
https://github.com/malithj/postCP_Improvement/commits/feature-model?author=malithj

The final package can be found in the link below.

https://github.com/malithj/postCP_Improvement

postCP change point detection with GSOC

No comments : Posted by Unknown at 11:34:00 AM Labels: #GSOC , #postCP , #R , #Tech

Introduction

The project aimed at improving the postCP package and making it available on CRAN again. The snag that prevented the package from being updated is the recent requirement that in the R code, .C() calls require DUP=TRUE arguments, and .Call() is suggested instead of .C(). The implementation of postCP package required that it's done in the most R compliant way. Separating out the model specific implementation and the core implementation. The core part is implemented in C++ for speed in calculations. The project page can be found here.

Implementation

To improve the usability and user friendliness, a glm syntax based "formula" and "family" specification was added. These commits are in the feature-glmsyntax branch. Also the model specific implementation was done by adding four models. (Gaussian, Poisson, Binomial and Gamma). The previous package only included three models. (Gaussian, Poisson and Binomial). These commits are found here.

https://github.com/malithj/postCP_Improvement/commits/feature-glmsyntax?author=malithj

The model specific part was integrated with the core specific part (C++ forward backward algorithm) to give the required results. Also following the change point model, a separate section was added for parameter calculations and parameter updates based on the updated log evidences. Standard error checks were included to handle incorrect user inputs as much as possible and to give meaningful error messages. Vignettes have been included to provide long form documentation. The commits are found here.

https://github.com/malithj/postCP_Improvement/commits/feature-model?author=malithj

The development has been completed as agreed in the project proposal. All the branches have been merged to the master branch and it reflects the latest package.

https://github.com/malithj/postCP_Improvement

The completed source code of the package has been included in the following public Google folder for viewing.

https://drive.google.com/drive/folders/0B8xO3Cc0h6rIbXNDTUM1TDlMbDQ?usp=sharing

Run time analysis

The following table represents a run time analysis of postCP package done on a 2.7 GHz machine with 4GB RAM running on Ubuntu 14.04. It shows a linear time complexity; O(N)

	CPU TIME ( Time in seconds )
n (iterations )	Mean Model	Slope Model
1000	0.104	0.056
10000	0.536	0.628
100000	2.636	2.3
1000000	16.076	13.776

The mean model and the slope model are given below.

Building the Package

1. Clone my repository https://github.com/malithj/postCP_Improvement

2. Navigate into the postCP folder

3. Build using R / RStudio. ( If it's RStudio use Ctrl + Shift + B )

Coding Guidelines

I referred to "Hadley Wickam R style" and "R style. An Rchaeological Commentary" for appropriate coding style guidelines because I saw that there are some issues with existing R packages because of not following best practices.

Final remarks

I would like to thank my mentors Gregory Nuel and Guillem Regaill for all the support and guidance and also bearing with me for the duration of the project. :D I would also like to thank Minh Luong (original creator of the package) for the additional documentation provide while I was trying to understand the previous implementation.

Finally thank you for GSOC for the great learning opportunity! :D

The official GSoC Project page can be found here.

Linked Blog posts

The following blog post explains my GSOC experience and the challenges that I had to face as a summary.
http://malithjayaweera.blogspot.com/2016/08/my-experience-with-gsoc-and-r.html

Conquering Raspberry Pi with Ubuntu 14.04 : Resolving issues with Partitioning

No comments : Posted by Unknown at 10:16:00 AM Labels: #Pi , #raspberryPi , #Ubuntu

Recently for a project, I used a Raspberry Pi. Although mostly it's preferred to use raspbian as an operating system, I chose to use Ubuntu because it gives access to the Ubuntu repositories as well. There is a complete guide for the installation here. However after sometime, I found that the Raspberry is utilizing only 2GB from its SD card which had the capacity of 8GB. This was quite a surprise because I formatted the SD card before installing Ubuntu on it.

I searched for a good resource to resolve this and I came across this wiki on elinux.org (http://elinux.org/RPi_Resize_Flash_Partitions). However based on their instructions using raspi-config was not an option because I was running Ubuntu 14.04.

The other option which actually worked for me was to manually resize the SD card on linux.

Before you proceed with anything, I highly suggest that you take a copy of your SD card using HDD Raw Copy Tool.

First to show partition information I used the following command.

$ df -h

Look for a partition which approximately has 2GB size of the distribution image.

$ umount /dev/mmcblk0p1

Then use parted to examine the card

$ sudo parted /dev/sdd
(parted) unit chs
(parted) print
Disk /dev/sdd: 121535,3,31
Sector size (logical/physical): 512B/512B
BIOS cylinder,head,sector geometry: 121536,4,32.  Each cylinder is 65.5kB.
Partition Table: msdos

Number  Start      End         Type     File system     Flags
 1      16,0,0     1215,3,31   primary  fat32           lba
 2      1232,0,0   26671,3,31  primary  ext4
 3      26688,0,0  29743,3,31  primary  linux-swap(v1)

Notice that nothing uses the card from the end of the cylinder 29743 to 121535 (which is the maximum). However in my case there wasn't a third partition called the swap. There were only two partitions; namely the boot partition and the root partition. I left the boot partition alone. I wanted to fill the SD card with the root partition. Nevertheless if there was a swap, then it would have to be moved to the end of the card. (You'll have to adjust the numbers so that the end of the partition 3 is at the end cylinder/head/sector of the card)

Here's the calculation.

Maximum - (Partition 3 End - Partition 3 Start) ) - 1 = Partition 3 New Start

So, in this example:

(121535 - ( 29743 - 26688)) -1 = 118479

Then move the partition. ( Note that this wouldn't work on parted versions later than 2.4 )

(parted) move 3 118479,0,0

However if there was no swap like in my case, life becomes much simpler.

(parted) rm 2
(parted) mkpart primary 1232,0,0 121534,0,0
(parted) quit

Note that in my case, the head and sector numbers did not exactly coincide. (eg:- 3,31 at the end). Therefore as a precaution, I counted one cylinder less from the total available cylinders. ( i.e. 121534,0,0 )

The rest of the steps were easy.

Clean and resize the root partition.

$ sudo e2fsck -f /dev/mmcblk0p1

Allow it to add lost and found

$ sudo resize2fs /dev/mmcblk0p1

After the steps this was the result.

Subscribe to: Posts ( Atom )

Malith's Perspective

Sunday, August 21, 2016

Changepoint Detection : Theoretical Background

Introduction

My Experience with GSOC and R

Friday, August 19, 2016

postCP change point detection with GSOC

Introduction

Implementation

Run time analysis

Building the Package

Coding Guidelines

Final remarks

Linked Blog posts

Wednesday, August 10, 2016

Conquering Raspberry Pi with Ubuntu 14.04 : Resolving issues with Partitioning

R-bloggers

Navigation

About Me

Blog Archive

Popular Posts

Labels