The core objective of this study was to automate the process of extracting messages (stock recommendations or trading tips) from a large set of exchanges within a WhatsApp group, formed exclusively to discuss and share stock market related information.
Parsing the extracted messages to find specifics of the recommendations and pulling data for the stocks to analyze performance were integral steps for the automation process to be more holistic and useful.
The flow chart that illustrates the overall approach adopted for the study
- WhatsApp group archive for the base/foundation data
- Quandl.com to compare the stock recommendations extracted from the WhatsApp data to stock prices and other related details
he raw data was processed to extract relevant Messages and Timestamps and thereafter the extracted data was further treated to remove
- Null values
- Stop words
- Emoticons and
- Unknown characters
3986 messages were labelled as recommendation / general and finally considered for further study.
To ensure that the key words needed for the study were available in high frequencies in the treated dataset, visualization was explored.
Words with higher frequency (in the word cloud) have been used to filter the dataset.
- Supervised approach
- Term Frequency Inverse Document Frequency (TFIDF) vectorizer to form the text vectors
- Multinomial Naive Bayes (MNB) to classify messages as
- Recommendation for stocks &
- General for other messages
The stated model had 96% accuracy rate.
The model classified 350 messages as Recommendation and the rest as General.
The required entities from the 350 Recommendation messages were extracted using Named Entity Recognition (NER). Stanford NER was used to get the following four entities
- Company name
- Target price
- Stop loss
- Period of recommendation
The stated model had 94% accuracy rate
The 350 stock messages (Recommendation) with identified entities were considered as Stock Recommendations against the NSE stock data (taken from quandl.com).
Note: The company names that were obtained from the NER were converted to its related
NSE ticker to get data from quandl.com.
is attached herewith and the returns are calculated using Arithmetic Average
Two methods of conversions were used to convert company name to NSE ticker
- Cosine distance
- Levenshtein/Edit distance The NSE tickers were then used to get the data from quandl.com via their API.
Disclaimer: The intent of the study was not to claim the accuracy of the stock recommendations from the results but rather to identify the methodologies to derive meaningful insights.