Logbook

16th October 2014

Today we started the journey! Professor Rui Maranhão shared with me some links that I should read:

I found any interesting paper too: Does Bug Prediction Support Human Developers? Findings From a Google Case Study

We were talking about the name and Professor Rui suggested that we use "Schwa", that were the first characters of the Turing Machine. Yeap, we have very creative people :)

David Lo will help us defining the models of the repositories http://scholar.google.com/citations?user=Ra4bt-oAAAAJ.

Alexandre Perez gave me more useful links to read:

ASE: http://ase2014.org/
FSE: http://fse22.gatech.edu/
TSE: http://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=32
TOSEM: http://tosem.acm.org/

Tomorrow I need really to start reading this sources and see what people are doing in Mining Software Repositories and Bug Prediction.

17th October 2014

I have started reading "Bug prediction at Google". They use machine learning and statistical analysis to guess if a component is buggy. The algorithm is a very cheap one: the bugs should be in the files that have most commits related to bug fixing. The problem was that if we fixed a file, the algorithm was giving too much attention to the older commits. Therefore, a fixed file was still appearing as a buggy one due the ammount of bug fixing commits. Considering this, they changed the formula of the score:

Ti is the time stamp of the bug fixing commit and is normalized from 0 (project start) to 1 (now). One can simply say "why we don't consider the ratio non-fixing commits / bug-fixing commits" Well, they tried this but found the results unsatisfying.

So I just wanted to start playing and I implemented the Google algorithm to score files. I created the core package to represent the Repository model (this model was temporary). Since the code needs to scale, I used maps to access files instead of doing loops to find them: "main.java" -> File Instance

While I was driving home I was thinking in another metrics we can use. For an instance, when people code at night usually the next day in the morning the code that they thaught that was perfect and amazing, doesn't work. If a commit changed a lot of code, like the programmer coded too much fast, it may be faulty.

24th October 2014

Today I started reading in the morning the "Degree project Mining Git Repositories An introduction to repository mining". It was very useful to read how mining GIT is beeing done and the author created a tool called Doris. The author describes the process of researching of what is beeing done, some benchmarks and how the tool is used. I reached this conclusions:

For large repositories we need a lot of disk space and hours to clone. Git's own project on Github took 38 hours to clone. We need to have a server close to Github server's to have more network speed;
We need to find a very efficient way to have the metadata of the repository. The author used XML because we have XPATH to retrieve data;
There are some bugs that may appear when we are reading the commits such as character's encoding;
There research of minning GIT repositories is very poor because people focus more in CVS in a lot of papers;
"Git repositories are not discussed much in academic works."
Some tools for minning GIT are hard to use because they have dependencies and poor documentation: Kenyon, APFEL, Evoizer, git_mining_tools, Shrimp,Gitdm.

The good news is that we are making a completely new product (yay :D) and the bad is that we may have problems finding research on mining Git repositories (challenge accepted).

31th October 2014

Today I started reading "Automatic Mining of Source Code Repositories to Improve Bug Finding Techniques" from 2005. I didnt't finish my reading but this paper is very interesting because it talks about how we can relate code changes to commits and then do the bug prevision.

1st November 2014

Finally, I finished my reading. The paper "Automatic Mining of Source Code Repositories to Improve Bug Finding Techniques" from 2005, explains how a tool to mine software repositories was created.

The tool works like this:

It looks to code changes and do not pay attention to commit messages because it was hard to correlate bug reports and the corresponding code changes;
They listed the typical kind of bugs we see in a software and produced a statical analysis tool;
The source code checker tries to find the most common type of bugs, ex. returns of functions that doesn't check the variable if is null, and outputs a list of warnings in the source code;
The problem is that the static analysis can produce a lot of false positives so the tool tries to predict the warnings that are most likely to be false positives or not, ranking them.
The algorithm of raking these warnings use historical context, the list of functions that are related in a potential bug fixing commit, and contemporary context information, that is looking in all the source code if the bug could be killed (this may not be clear);
They just analysed one kind of bug: Not checking the return of a function;
They said that in the future they need to extend the tool to analyse more kind of bugs.

Conclusions

This is the first paper that I actually see that pays attention to code changes. The conclusion that I get is that we need to analyse how code changes between commits and see what type of bug fixing this is. So, just checking if a commit is a bug fix is not enough: we need to determine what is the bug!

They needed, like us, to create a model of the CVS repository to improve the information retrieval. It was used Perl and MySQL.

We can improve our tool, analysing more types of bugs in code source changes.

9th November 2014

Today I read the paper "Does Bug Prediction Support Human Developers? Findings From a Google Case Study" to see how developer react to this kind of tools. This paper describes an effort of deploying a bug prediction tool in Google using 2 software projects that are very mature. They tried to answer 3 questions:

RQ1 Which bug prediction algorithm is preferred?

They made a study by doing interviews to developers and showing the results of Fixcache and Rahman algorithms and by the results, Rahman was better because it was showing up the most recent files. This is because developers are afraid of working on old files.

RQ2 What are the characteristics that a bug prediction algorithm should have?

They did discussions with Googlers and found out that the algorithm should:

Have actionable messages: The tool should give steps to fix the problem because developers are used to work in this way.
Obvious reasoning: The tool should show a strong, visible and clear reason of the problem. Developer tend to ignore false positives;
Bias towards the new: Developers are more concerned with new files that are causing real problems;

RQ3 Do developers change their behaviour with bug prediction results?

Based on the first answers, researchers changed the Rahman algorithm and used the formula we saw in the article "Bug prediction at Google" and called it Time-Weighted Risk (TWR). So, they deployed the tool for 3 months and evaluated the results by seeing how the tool reduced the time and cost of code reviews. Well, the tool didn't worked well because:

People didn't knew how to fix the code (not having actionable messages);
The tools was flagging again and again files and developers didn't feel motivated to make more changes;
The tool was flagging more teams that didn't used the Google's most used Bug Tracking tool. Researchers concluded that, although the result wasn't very positive at Google, it may work better in another companies that have different development processes and that developers might be the wrong audience for bug prediction tools, so software quality staff would be more interested on this tools.

Conclusions:

It's important not only to flag files but explain why and suggest action to users;
It should be easy to integrate bug prediction tools with the tools that the company is already using;
If we show developers flagged files that don't belong to them or are very old, they are afraid of making changes.

5th January 2015

Well, I didn't update this logbook since I started writing the first paper about this thesis... I have started in the last days the PDIS final report where the goal is to talk about related work.

8th January 2015

Today I writed the TWR model in the Related Work chapter and I found that I should talk too about Fixcache because it looks at source code and may solve some of the problems of the TWR model. Tomorrow I will write the section about minning Git repositories. Found an interesting article about writing as a native english speaker: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3935133/

#12th January 2015 I was thinking about the related work chapter and it's I reorganized in Realiability Models, Repository Mining, Similar Tools and Developers Behaviour. Still not satisfied with what I am going to write because there are papers I read that doesn't include here.

#14th January 2015 Started reading about Bugcache in the article "Predicting Faults from Cached History". This helped me to learn more about Fixcache too. One discussion that is presented in this article is the predictive power of cache-based solutions.

#29th January 2015 Started today developing the presentation for PDIS. Yesterday I had my last exam so I am now full time on this. In research in development, in such an early stage it's hard to find a good architecture and plan for the project since a lot of bad things can happen: models that are good in theory but bad in practice.

I was thinking about a good way of combining TWR and Fixcache, because TWR uses more concrete data and Fixcache assumption so maybe TWR can be a trust measure on the Fixcache cache.

#30th January 2015 Finished the final presentation for PDIS. Added the Plan and the Layered Architeture Diagram. Happy to see things rolling and excited to start coding the fault model. But first, need to finish the PDIS report with the Related Work.

#2nd February 2015 Today I did the final presentation of PDIS course. I am very happy: professors enjoyed my presentation and are very curious for the results of my thesis. They gave some useful feedback and suggestions:

Ademar Aguiar: The analysis of developers behaviour can be important to estimate failures and it's important to speak in detail about Crowbar since people may not know it.
João Cardoso: Suggested exploring the Mining method although in this phase the data is too simple to feed the defect prediction algorithms.
Pascoal Faria: Suggested used the history of component's failure to feed the prediction algorithms. Actually it's a very good idea since in literature, Mining Software Repositories is not just about the repository itself but about combining information from the Issue Tracker. It's not an approach I must focus now but can be used to improve the diagnostic quality.

Let's now write the final paper of PDIS!

The only person in literature that talks about Change Classification it's Sung Kim and I found today a really good presentation called "Dealing with Noise in Defect Prediction" that explains how Change Classification works and how we can evaluate the performance of prediction models (slide 16).

An article discusses approaches in Change Classification and explains it in detail "Classifying Software Changes: Clean or Buggy?"

#18th February 2015 In the last weeks I have been writing the PDIS report and delivered it at 15th February. After resting 2 days from Carnival, I am starting today the Dissertation for real.

#26th February 2015 (Wednesday) Since 18th I have implemented an experimental version of Schwa that analyzes any repository and outputs metrics at the file granularity. Last Tuesday I had a meeting where I received some feedback about the report that is essencialy:

I need to structure the introduction better;
Use shorter sentences;
Talk about barinel in related work;
Reorganize the related work;
Make clear the goals, questions and hypothesis

Regarding the tool, since Tuesday I am working in a technique for extracting differences at methods granularity from Java files in every commit. After thinking and thinking I have come with a simple algorithm (brief overview):

file_a = parse_file(a)
file_b = parse_file(b)

removed_methods = methods_a - methods_b
added_methods = methods_b - methods_a
modified_methods = ()

for every method that exists both in file a and b:
    if code_is_different(method_file_a, method_file_b): // Problem1: How to extract portions of code?
        modified_methods.add(method)

Problem1: I am developing a way of extracting only the functions' code from Java, by defining a grammar using pyparsing. Then, I simply use difflib to compute if there are differences.

By analyzing at each commit, the speed of executing the extraction will decrease, but is something that we cannot do much about it.

#2nd March 2015 Today I have come with a simple and efficient way of representing diffs of any granularity. I created three classes: Diff, DiffClass, DiffMethod and DiffFile. For each type of diff, we activate the most convenient flag: renamed, modified, added or removed.

I finish the first working version of my Java Parser and I am integrating now with the new diff model in the extractor. There are still some unsolved issues:

How to manage functions overloading? (I only have a partial and not efficient solution)
How to detect renamed classes and methods? (I only have a partial and not efficient solution)

#3rd March 2015 Good news and bad new: I made the method granularity possible but because I am using a parser to compare the code of all the methods in class, the extraction is now tremendously slow! I need to find a way of interpreting diff informations and only analyze changed code. I changed the way Analytics are display using now classes instead of Dictionaries (they say it is more efficient) making the code more expressive.

Important things to do also: Documenting the code and implement Unit Tests with 100% code coverage. Programming without TDD is not the same.

#4nd March 2015 Today I had a meeting with Rui Abreu to talk about the status of development. Although we achieved the method granularity, the Java Parsing is not efficient and it is taking too much time just to analyze diffs between 2 versions of a file. Rui also explained me how Schwa is going to integrate with Crowbar: P(C) and P(C | A).

The main goals for the next weeks are:

Find a way of analyzing diffs very fast
Find a probability model that receives as input the metrics: F(fixes, authors, twr, revisions)

Alexandre gave an idea about how to guess the component from the line number: annotate the AST with the line number.

#5nd March 2015 Today I solved two things that seemed to be kind of impossible: created and implemented a technique to parse methods and their lines range (e.g. [1,10,API.run()]) and a technique to extract sequence of diffs (2,30,+). I am not supporting nested classes at the moment. Also, I am forcing myself to use TDD!

#9th March 2015 Finished JavaParser.diff() with the new changed sequences techniques and implemented the probability model. To compute the defect probability of each component, the tool now spends more time (45% defect computation). Since I need to easy understand the analytics, I started building today the Sunburst interface using d3js:

We are discussing with David Loo the probability model.

#10th March 2015 Changed how the defect probability is calculated: TWR for all metrics and weighted average of their probabilities. TWR is converted to probability using a modified exponential function. BUT, we need to train the parameters of this probability model with Machine Learning Techniques. That's why I started today learning the Stanford Course on Machine Learning. Hope to see some new ideas in the next days.

#11th March 2015 Continued studying about Machine Learning and wondering if Neural Networks can be the solution to P(authors, fixes, revisions). The problem is that we don't have a complete data set, since we only know when the P=100% that is when we have a fix.

#12th March 2015 Had a meeting with Rui and Alex. We discussed a possible way of calculating the probability for each component: use fuzzy Clustering and use revisions_twr, fixes_twr, authors_twr, is_bug as the dataset. We concluded that we have a classification problem.

#13th March 2015 Investigated today about Fuzzy Clustering and got not practical results and examples. Found scikit for Python and a good map to choose the right technique: http://scikit-learn.org/stable/tutorial/machine_learning_map/ Linear SVM are supposed to be adequate to our dataset but after training, the results are biased: 0,73, 0.72 probability ??

Wondering now if our dataset is bad? We need to use feature selection :/ I have ploted to see if there are some clusters and I got this nightmare: https://files.slack.com/files-pri/T028BEFR1-F0417SF1Y/dataset.png

#16th March 2015 Today I was focused in writing Unit Tests for the Git Extractor and making it running smoothly in Travis CI. Regarding the Defect Probability Model, I believe that statistical correlation is the best way. Found that sometimes there is a correlation about fixes and authors and other times not..

#17th March 2015 Today I analyzed the correlation about Authors, Fixes and Revisions, Fixes using Pearson. https://docs.google.com/a/gcloud.fe.up.pt/spreadsheets/d/1-08Me7mV7a8AM6ywLZN0BZ0LcBEDTDYmEO6pFwzgtEc/edit?usp=drive_web

Found that Fixes and Revisions have a stronger correlation but still in doubt about causality and if there is a missing lurking variable. Still a long distance from having a Probability model using a classifier... From the dataset (revisions_twr, fixes_twr, authors_twr, have_bug) isn't enough!

#18th March 2015 Implemented the experimental defect probability model with the weighted average of Revisions, Authors and Fixes TWR. I have written too unit tests for the Schwa Analysis and fixed the Anonymous classes bug.

#19th March 2015 Now results are represented in a Sunburst chart. Is very easy now to evaluate the results and improve the defect prediction module. I also configured the project to be a Python CLI program and be easily distributed.

#20th March 2015 Commented Python Code with PSF and Google code conventions.

#23-27th March 2015 This week was abnormal and things didn't go well but we are always learning. I started the Crowbar integration but first, I took 5 days to be able to build in my Mac due a bug in Java Scanner for Locales. I have learned important things this week:

Software documentation is important and must be an habit. We never know when someone is going to need to use our code;
Understand the kind of involvement developers have in a project and see if when they are helping you, they are doing because they want and not feel obligated;
If we are integrating our Software project with existing code, make sure you understand that code as soon as the project start;

I have improved Schwa documentation and I will force myself to create a habit of updating it in a weekly basis. Next Week I am in Easter vacations without computers to clear my mind and return to make amazing things.

#7-10th April 2015 This week I finally found a way of solving the parsing problem. I studied how Plyj worked and modified it to annotate line ranges of classes and methods. It is working! I am currently updating this changes in others layers: Extraction, Parsing, Analysis.

I had a meeting with Rui Maranhão and the work plan is: Parser, Integrate with Crowbar and Machine Learning.

I had an health problem this week and visited 2 companies so I didn't dedicated the usual time: next week will be better.

#13-17th April 2015 Finished the Java Parser this week and had a meeting with Rui and Alex. Next goals are:

Integrate with crowbar
Change features weight and see the impact in Fault Localization
Evaluate features weight by testing with a lot of repositories
Use churn as a feature

#20th April 2015

Improved bug fixing detection
Worked in distributing with Pypi. Don't use cx_freeze: it sucks and is buggy

#21th April 2015

Fixed a Windows Bug with MP - disabled MP because Windows does not fork()
Started integration with Crowbar. Schwa analysis should be on the agent in premain

#27-30th April 2015

Integrated Schwa with Crowbar as a CMD driver

#4-8th May 2015

Refactored some code in Crowbar
Defined the experiments for the next weeks
Deadline for experiments: 22 May
Deadline for report: 12 June

11th May 2015

Started implementing the Feature Weight Learner

12th May 2015

I have finished the Feature Weight Learner. I have put BITS_PRECISION, POPULATION and GENERATIONS as params to manipulate for getting better results. For the sake of performance I am using file granularity. The results are the same as we had when tried to use Support Vector Machines and plot fixes TWRs in a scatter: features weights are random.

To output learning results, use the learning option "-l":

schwa libcrowbar -l --commits 2
{'fixes': 0.14285714285714285, 'authors': 0.8571428571428571, 'revisions': 0.0}

schwa libcrowbar -l --commits 2
{'revisions': 0.5714285714285714, 'authors': 0.0, 'fixes': 0.42857142857142855}

The learner may be stuck in a local optimum, GA are not suited for this problems or so far we cannot generalize Feature Weights as we expected.

13th and 14th May 2015

Improved Feature Weight Learner
Build a presentation about Schwa

18th to 22th May 2015

Added constraints to the Feature Weight Learner and now I am having better results.
Revisions is the feature with the most weight but sometimes Fixes is the one. Maybe different scenarios?
Threats to validity in Features Weight Learner: repos with only one contributor

25th to 29th May 2015

Implemented Diagnostic Cost in Crowbar

Provide feedback

Saved searches

Use saved searches to filter your results more quickly