Sunday, August 21, 2016
Changepoint Detection : Theoretical Background
Introduction
Changepoints are abrupt variations in the generative parameters of sequential data. Changepoint detection is the process of identifying such abrupt variations and it has been useful in a wide range of applications such as EEG analysis, DNA segmentation, econometrics etc.
Bayesian ways of change point detection focus on segmentation and techniques to generate samples from the posterior distribution over changepoint locations.
Let's try to understand this without a lot of complex mathematical notations. When we are given a data set that varies according to time ( e.g: - a time series ), there might be sudden changes. Because of these changes the data set can be partitioned and separately analysed. Therefore if we assume that within a partition there is a relationship between data points, we can perform a regression to calculate regression coefficients. (Provided that we have an idea about the change points)
However this does not capture the uncertainty involved with the change points. This is where change point distributions come in to play. Following the Bayesian change point model, and by using a constrained HMM (Hidden Markov Model ) we are able to calculate the posterior distribution.
Therefore the posterior distribution can be calculated with a modified forward backward algorithm. The algorithm can be further improved to compute the most probable change point configuration.
The references below provide more insight to the postCP package.
[1] Luong, T.M., Rozenholc, Y. and Nuel, G., 2013. Fast estimation of posterior probabilities in change-point analysis through a constrained hidden Markov model. Computational Statistics & Data Analysis, 68, pp.129-140.
However this does not capture the uncertainty involved with the change points. This is where change point distributions come in to play. Following the Bayesian change point model, and by using a constrained HMM (Hidden Markov Model ) we are able to calculate the posterior distribution.
Therefore the posterior distribution can be calculated with a modified forward backward algorithm. The algorithm can be further improved to compute the most probable change point configuration.
The references below provide more insight to the postCP package.
[1] Luong, T.M., Rozenholc, Y. and Nuel, G., 2013. Fast estimation of posterior probabilities in change-point analysis through a constrained hidden Markov model. Computational Statistics & Data Analysis, 68, pp.129-140.
My Experience with GSOC and R
It all began when I started searching for a Google Summer of Code project last year (November, 2015) . While I was searching through the web found this page that suggested of a project idea. I didn't have a complete understanding about the problem but I contacted the mentors and familiarized myself with R and the theoretical background of the package. In March, I submitted a project proposal and got selected for GSOC finally. (Of course I had to submit a test to the mentors first)
The actual challenge began only then. My mentor, Professor Gregory Nuel sent me a paper he had presented which included a detailed explanation about the change point model and some code examples. I must say that it took a while to understand the document completely with the different notations used. I always kept bothering my mentors for further details whenever I had a question. A significant proportion of the complexity of the project was due to the theoretical background required to the project.
I started with simple steps. Model specific implementation required to choose a few models and finalize the calculations for the log evidence. I was able to finalize this with the help of my mentors. The core implementation of the forward backward algorithm was done in C++. I had several issues interfacing it with the R code using Rcpp. I was able to rectify them by adding useDynLib() in the NAMESPACE.
After the core forward backward algorithms were interfaced, I could proceed with the core model part. Following the change point model, I coded to find the initial regression coefficients using standard R functions. The result was then used to calculate model specific log evidences which were input to the forward backward algorithm to compute posterior probability distributions using HMM. Finally the posterior change point probability distribution code be obtained.
Using the change point probability distribution, the parameters and the log evidence was updated. Finishing off the core implementation part of the package.
However at that time I found out that I hadn't followed a proper coding style guideline. Therefore I referred to "Hadley Wickam R style" and "R style. An Rchaeological Commentary". Well, all of this would be an utter waste if the user did not have any source to understand how to use the functions. Therefore I spent time on improving function documentation as well.
To provide an insight to the theoretical background behind the package, I added Vignettes which is a long form documentation in R packages.
Throughout the project I learned new things and I always tried to develop the postCP package in the most R compliant way. All the requirements have been met and the project has been successfully finished.
Here is the source folder which contains the improvements I made to the package.
https://drive.google.com/drive/folders/0B8xO3Cc0h6rIbXNDTUM1TDlMbDQ?usp=sharing
Here are the commits done throughout the project.
https://github.com/malithj/postCP_Improvement/commits/feature-glmsyntax?author=malithj
https://github.com/malithj/postCP_Improvement/commits/feature-model?author=malithj
The final package can be found in the link below.
https://github.com/malithj/postCP_Improvement
The actual challenge began only then. My mentor, Professor Gregory Nuel sent me a paper he had presented which included a detailed explanation about the change point model and some code examples. I must say that it took a while to understand the document completely with the different notations used. I always kept bothering my mentors for further details whenever I had a question. A significant proportion of the complexity of the project was due to the theoretical background required to the project.
I started with simple steps. Model specific implementation required to choose a few models and finalize the calculations for the log evidence. I was able to finalize this with the help of my mentors. The core implementation of the forward backward algorithm was done in C++. I had several issues interfacing it with the R code using Rcpp. I was able to rectify them by adding useDynLib() in the NAMESPACE.
useDynLib(postCP)
After the core forward backward algorithms were interfaced, I could proceed with the core model part. Following the change point model, I coded to find the initial regression coefficients using standard R functions. The result was then used to calculate model specific log evidences which were input to the forward backward algorithm to compute posterior probability distributions using HMM. Finally the posterior change point probability distribution code be obtained.
Using the change point probability distribution, the parameters and the log evidence was updated. Finishing off the core implementation part of the package.
However at that time I found out that I hadn't followed a proper coding style guideline. Therefore I referred to "Hadley Wickam R style" and "R style. An Rchaeological Commentary". Well, all of this would be an utter waste if the user did not have any source to understand how to use the functions. Therefore I spent time on improving function documentation as well.
To provide an insight to the theoretical background behind the package, I added Vignettes which is a long form documentation in R packages.
Throughout the project I learned new things and I always tried to develop the postCP package in the most R compliant way. All the requirements have been met and the project has been successfully finished.
Here is the source folder which contains the improvements I made to the package.
https://drive.google.com/drive/folders/0B8xO3Cc0h6rIbXNDTUM1TDlMbDQ?usp=sharing
Here are the commits done throughout the project.
https://github.com/malithj/postCP_Improvement/commits/feature-glmsyntax?author=malithj
https://github.com/malithj/postCP_Improvement/commits/feature-model?author=malithj
The final package can be found in the link below.
https://github.com/malithj/postCP_Improvement
Friday, August 19, 2016
postCP change point detection with GSOC
Introduction
The project aimed at improving the postCP package and making it available on CRAN again. The snag that prevented the package from being updated is the recent requirement that in the R code, .C() calls require DUP=TRUE arguments, and .Call() is suggested instead of .C(). The implementation of postCP package required that it's done in the most R compliant way. Separating out the model specific implementation and the core implementation. The core part is implemented in C++ for speed in calculations. The project page can be found here.Implementation
To improve the usability and user friendliness, a glm syntax based "formula" and "family" specification was added. These commits are in the feature-glmsyntax branch. Also the model specific implementation was done by adding four models. (Gaussian, Poisson, Binomial and Gamma). The previous package only included three models. (Gaussian, Poisson and Binomial). These commits are found here.
The model specific part was integrated with the core specific part (C++ forward backward algorithm) to give the required results. Also following the change point model, a separate section was added for parameter calculations and parameter updates based on the updated log evidences. Standard error checks were included to handle incorrect user inputs as much as possible and to give meaningful error messages. Vignettes have been included to provide long form documentation. The commits are found here.
The development has been completed as agreed in the project proposal. All the branches have been merged to the master branch and it reflects the latest package.
The completed source code of the package has been included in the following public Google folder for viewing.
Run time analysis
The following table represents a run time analysis of postCP package done on a 2.7 GHz machine with 4GB RAM running on Ubuntu 14.04. It shows a linear time complexity; O(N)
CPU TIME ( Time in seconds ) | ||
n (iterations ) | Mean Model | Slope Model |
1000 | 0.104 | 0.056 |
10000 | 0.536 | 0.628 |
100000 | 2.636 | 2.3 |
1000000 | 16.076 | 13.776 |
The mean model and the slope model are given below.
Building the Package
1. Clone my repository https://github.com/malithj/postCP_Improvement
2. Navigate into the postCP folder
3. Build using R / RStudio. ( If it's RStudio use Ctrl + Shift + B )
Coding Guidelines
I referred to "Hadley Wickam R style" and "R style. An Rchaeological Commentary" for appropriate coding style guidelines because I saw that there are some issues with existing R packages because of not following best practices.
Final remarks
I would like to thank my mentors Gregory Nuel and Guillem Regaill for all the support and guidance and also bearing with me for the duration of the project. :D I would also like to thank Minh Luong (original creator of the package) for the additional documentation provide while I was trying to understand the previous implementation.
Finally thank you for GSOC for the great learning opportunity! :D
The official GSoC Project page can be found here.
Linked Blog posts
The following blog post explains my GSOC experience and the challenges that I had to face as a summary.
http://malithjayaweera.blogspot.com/2016/08/my-experience-with-gsoc-and-r.html
http://malithjayaweera.blogspot.com/2016/08/my-experience-with-gsoc-and-r.html
Wednesday, August 10, 2016
Conquering Raspberry Pi with Ubuntu 14.04 : Resolving issues with Partitioning
Recently for a project, I used a Raspberry Pi. Although mostly it's preferred to use raspbian as an operating system, I chose to use Ubuntu because it gives access to the Ubuntu repositories as well. There is a complete guide for the installation here. However after sometime, I found that the Raspberry is utilizing only 2GB from its SD card which had the capacity of 8GB. This was quite a surprise because I formatted the SD card before installing Ubuntu on it.
I searched for a good resource to resolve this and I came across this wiki on elinux.org (http://elinux.org/RPi_Resize_Flash_Partitions). However based on their instructions using raspi-config was not an option because I was running Ubuntu 14.04.
The other option which actually worked for me was to manually resize the SD card on linux.
Before you proceed with anything, I highly suggest that you take a copy of your SD card using HDD Raw Copy Tool.
First to show partition information I used the following command.
I searched for a good resource to resolve this and I came across this wiki on elinux.org (http://elinux.org/RPi_Resize_Flash_Partitions). However based on their instructions using raspi-config was not an option because I was running Ubuntu 14.04.
The other option which actually worked for me was to manually resize the SD card on linux.
Before you proceed with anything, I highly suggest that you take a copy of your SD card using HDD Raw Copy Tool.
First to show partition information I used the following command.
$ df -h
Look for a partition which approximately has 2GB size of the distribution image.
$ umount /dev/mmcblk0p1
Then use parted to examine the card
$ sudo parted /dev/sdd (parted) unit chs (parted) print Disk /dev/sdd: 121535,3,31 Sector size (logical/physical): 512B/512B BIOS cylinder,head,sector geometry: 121536,4,32. Each cylinder is 65.5kB. Partition Table: msdos Number Start End Type File system Flags 1 16,0,0 1215,3,31 primary fat32 lba 2 1232,0,0 26671,3,31 primary ext4 3 26688,0,0 29743,3,31 primary linux-swap(v1)
Notice that nothing uses the card from the end of the cylinder 29743 to 121535 (which is the maximum). However in my case there wasn't a third partition called the swap. There were only two partitions; namely the boot partition and the root partition. I left the boot partition alone. I wanted to fill the SD card with the root partition. Nevertheless if there was a swap, then it would have to be moved to the end of the card. (You'll have to adjust the numbers so that the end of the partition 3 is at the end cylinder/head/sector of the card)
Here's the calculation.
Maximum - (Partition 3 End - Partition 3 Start) ) - 1 = Partition 3 New Start
So, in this example:
(121535 - ( 29743 - 26688)) -1 = 118479
Then move the partition. ( Note that this wouldn't work on parted versions later than 2.4 )
(parted) move 3 118479,0,0
However if there was no swap like in my case, life becomes much simpler.
(parted) rm 2 (parted) mkpart primary 1232,0,0 121534,0,0 (parted) quit
Note that in my case, the head and sector numbers did not exactly coincide. (eg:- 3,31 at the end). Therefore as a precaution, I counted one cylinder less from the total available cylinders. ( i.e. 121534,0,0 )
The rest of the steps were easy.
Clean and resize the root partition.
$ sudo e2fsck -f /dev/mmcblk0p1
Allow it to add lost and found
$ sudo resize2fs /dev/mmcblk0p1
After the steps this was the result.
Subscribe to:
Posts
(
Atom
)
R-bloggers
About Me
Blog Archive
Popular Posts
-
Introduction Changepoints are abrupt variations in the generative parameters of sequential data. Changepoint detection is the process o...
-
Introduction The project aimed at improving the postCP package and making it available on CRAN again. The snag that prevented the pac...
-
It all began when I started searching for a Google Summer of Code project last year (November, 2015) . While I was searching through the we...
-
Since I'm on vacation, I thought of planning my project using the free time. As the next step, I thought of developing the front end of...
-
In building the required libraries for a docker container, using a maven project, the libraries have to be copied to a separate location an...
-
It was quite a daunting task at the beginning to start with Kubernetes 1.7 alpha release because I knew that I was bound to face with diffi...
-
Recently for a project, I used a Raspberry Pi. Although mostly it's preferred to use raspbian as an operating system, I chose to use U...
-
Last week we were assigned projects for our 4th semester and I was selected for the energy sector. As a part of it, when I was exploring t...
Powered by Blogger.
Labels
#GSOC
(3)
#Tech
(3)
#postCP
(3)
#Kubernetes
(2)
#R-gsoc
(2)
#docker
(2)
#CentOS
(1)
#Cluster
(1)
#Pi
(1)
#R
(1)
#Ubuntu
(1)
#docker-maven-plugin
(1)
#fabric8
(1)
#raspberryPi
(1)
Air Conditioning
(1)
Embedded
(1)
Micro Controllers
(1)
Power consumption
(1)
Smart Systems
(1)
Tech
(1)
© Malith's Perspective 2016 . Powered by Bootstrap , Blogger templates and RWD Testing Tool