Dylan Huang | HQTrivia Solver

Motivation

During the winter break of 2017, I decided to work on a project that would make me learning something new. Inspired by this article, I decided to make my own trivia solver based on the trending HQTrivia app. In the following section, I am going to outline the approach that I designed and implemented.

High-level approach

If you have not played HQTrivia, the idea is to answer a multiple choice trivia question in under 10 seconds. If you can answer 12 questions in a row, you win a cash prize!

getting the data

The first thing that I needed to do was to find out how to grab the HQTrivia data. A lot of people used OCR along with hooking up their phone to their computer. But I figured there had to be a better way. After some research, I found someone who had sniffed the packets coming from their phone while the HQTrivia app was open and found out that HQTrivia serves their questions through a websocket connection (I had to learn how to use websockets). After doing the same thing, I was able to discover the schema and HQTrivia’s schema for questions. To do the packet sniffing, I used a web debugging proxy called Charles and proxied my phone’s network through my computer.

sniff sniff

choosing the model

The next thing I knew was that I wanted to use some sort of machine learning model to solve the questions (lol buzz words), but I had no previous experience using a model to solve anything. The first thing I did was learn about what exactly a “model” is, what models currently exist, and what the hell a feature is.

After learning about my options for a model, I decided on using a decision tree with gradient boosting from the LightGBM by Microsoft because it was proven to be better than xgboost, but xgboost has proven to be extremely effective in machine learning competitions that require quick learning, classification, or learning on small amounts of data (due to gradient boosting).

creating the features

The next step was for me to create features. Below is a list of features that I designed using Google and Bing searches:

Google search each answer and count number of occurrences for each evaluated important word
Bing search each answer and count number of occurrences for each evaluated important word
Google question and each answer, create document corpus from each answer search, compare cosine similarity of question search to corpus
Google search each important word and count number of occurrences for each answer
Bing search each important word and count number of occurrences for each answer
Google question with each answer appended and count number of search results
Google question with each answer appended and count number of search results
Google important words with each answer appended and count number of search results
Google noun chunks with each answer appended and count number of search results
Wikipedia search each answer and count number of occurrences for each evaluated important word
Google question with each answer appended in quotes and count occurrences of each answer
Bing search question with each answer appended in quotes and count occurrences of each answer
Google question with each answer appended in quotes and count occurrences of each important word
Google evaluated noun chunks of question and count occurrences of each answer
Google the question and count occurrences of each answer

Annoying thing

Since google rate limits your searches, I had to use a VPN to change every 3 questions during a game because I was doing upwards of 40 google searches for each question, depending on the evaluation of the question. Also, since network requests are relatively slow to CPU computation, I had to implement an immense amount of multithreading to deal with the massive amount of network I/O that I was executing for each question.

What I learned

Gradient Boosting and LightGBM
NLP with spaCy
Machine Learning Features
Websockets
TF-IDF / Cosine Similarity
Multithreading network requests
Lots of Python

Conclusion

After months of testing and gathering data to train my supervised model, I was able to get about a ~80% accuracy rate (which is not that great if you think about the chance of getting 12 right in a row .8 ^ 12 = .0687) and at an average time of ~7 seconds per questions. I gathered a total of 629 questions. And I also won a game worth $53.19 dollars (note that its not my name because it was 3 of us and we decided to split on the last question to guarantee a win)!

Yay!

All the gathered questions, model, and source code can be found here :)!