This notebook is a continuation of this notebook on classification, where we clustered related Q&A blog posts for a recommendation engine. The next step is building a classifier on this type of data. A challenge of Q&A site admins is maintaining a decent level of quality in the post content. Higher quality post content ultimately results in more users.
One way to encourage quality content is allow the question asker to flag one answer to their question as the accepted answer (this is how stackoverflow
works. This results in more score points for the asker
and answer(er)
.
What if instead the website continuously evaluated the answer submissions in-progress and provide feedback as to whether the answer shows signs of being inadequate. For example, providing code output, images, and a certain word count.
To achieve this we will need to tune the data and classifier extensively, which I will walk-through (at least the basic structure) in this notebook.
The team behind stackoverflow
provides much of the data under a CC Wiki license. The latest data dump can be found here. The dataset we need is the posts.xml
. This dataset contains the question and answer post content.