For the MSR 2016 Challenge, I applied data science to test a gut feeling I had: commits with weird, unusual log messages are often committing lower quality than code than boring, ordinary commit messages.
So: Are commits that look fishy…
…actually hiding dubious code?
Is my hunch true? (Spoilers: Marginally).
Developers summarize their changes to code in commit messages. When a message seems “unusual,” however, this puts doubt into the quality of the code contained in the commit. We trained n-gram language models and used cross-entropy as an indicator of commit message “unusualness” of over 120 000 commits from open source projects. Build statuses collected from Travis-CI were used as a proxy for code quality. We then compared the distributions of failed and successful commits with regards to the “unusualness” of their commit message. Our analysis yielded significant results when correlating cross-entropy with build status.
EDIT: Added my presentation.