The use of artificial intelligence in the financial market has been growing annually due to the great capacity of these algorithms to help in decision-making, trend prediction and sentiment analysis. In the context of sentiment analysis, this research was conducted with the objective of creating an efficient pipeline for pre-processing tweets about Brazilian stock exchange shares to prepare them for a next stage to be developed in the master's research of co-author Leandro Araújo (Language model for the Brazilian stock market: An approach based on sentiment analysis using the BERTimbau model)
Test and analyze natural language preprocessing techniques in Portuguese-language tweets about the Brazilian financial market.
1- Lowercase. 2- Remove SPAM (example: “I beg for help”). 3-Remove URL. 4-Remove RT. 5-Remove emails. 6-Remove mentions (@). 7-Remove hashtags. 8-Remove \n. 9-Remove numbers. 10-Remove unnecessary symbols. 11-Remove tweets that are not in Brazilian Portuguese. 12-Normalizer. 13-Remove accents. 14-Remove stopwords. 15-Remove swear words. 16-Stemmatization, 17-Tokenization, TF-IDF and Bag of Words.