Bao's Research Page

Selected Papers

Named Entity Recognition for Vietnamese Real Estate Advertisements
Son Huynh, Khiem Le, Nhi Dang, Bao Le, Dang Huynh, Binh T. Nguyen, Trung T. Nguyen, Nhi Y. T. Ho
The 8th NAFOSTED Conference on Information and Computer Science (NICS), Accepted, 2021.

With the booming development of the Internet and e-Commerce, advertising has appeared in almost all areas of life, especially in the real estate domain. Understanding these advertising posts is necessary to capture the status of real estate transactions and rent and sale prices in different areas with various properties. Motivated by that, we present the first manually annotated Vietnamese dataset in the real estate domain. Remarkably, our dataset is annotated for the named entity recognition task with lots of entity types. In comparison to other Vietnamese NER datasets, our dataset contains the largest number of entities. We empirically investigate a strong baseline on our dataset using the API supported by the spaCy library, which comprises four main components: tokenization, embedding, encoding, and parsing. For the encoding, we conduct experiments with various encoders, including Convolutions with Maxout activation (MaxoutWindowEncoder), Convolutions with Mish activation (MishWindowEncoder), and bidirectional Long short-term memory (BiLSTMEncoder). The experimental results show that the MishWindowEncoder gives the best performance in terms of micro F1-score (90.72 %). Finally, we aim to publish our dataset later to contribute to the current research community related to named entity recognition.

A New Approach for Vietnamese Aspect-Based Sentiment Analysis
Bao H. Le , Hoang Minh Nguyen, Nhi Kieu-Phuong Nguyen, Binh Nguyen
The 14th International Conference on Knowledge and Systems Engineering (KSE), Accepted, 2022.

Intelligent systems, especially smartphones, have become crucial parts of the world. These devices can solve various human tasks, from long-distance communication to healthcare assistants. For this tremendous success, customer feedback on a smartphone plays an integral role during the development process. This paper presents an improved approach for the Vietnamese Smartphone Feedback Dataset (UIT-ViSFD), collected and annotated carefully in 2021 (including 11,122 comments and their labels) by employing the pretrained PhoBERT model with a proper pre-processing method. In the experiments, we compare the approach with other transformer-based models such as XLM-R, DistilBERT, RoBERTa, and BERT. The experimental results show that the proposed method can bypass the state-of-the-art methods related to the UIT-ViSFD corpus. As a result, our model can achieve better macro-F1 scores for the Aspect and Sentiment Detection task, which are 86.03\% and 78.76\% , respectively. In addition, our approach could improve the results of Aspect-Based Sentiment Analysis datasets in the Vietnamese language.

OphNER: Named Entity Recognition for Ophthalmology Newspapers
with
Bao H. Le , Thi Quynh Pham, Anh Thi Van Hoang, Binh Nguyen
The 15th International Conference on Knowledge and Systems Engineering (KSE), Accepted, 2023.

The Fourth Industrial Revolution has turned electronic devices into the main tools in human daily life. At the same time, the consequences of the developed economy, such as environmental pollution, viruses, and bacterial strains, are also the main causes of eye diseases. Under the development of media, eye diseases have been quickly and fully synthesized through different types of texts. Texts on ophthalmology provide information about symptoms, disease manifestations, agents, prevention, or treatment in great detail. A large amount of information makes it more difficult to find and filter. Since then, it has become more urgent to build both a corpus and a system to identify and categorize the information from official sources so that everyone can easily find relevant information and better understand related terms to ophthalmology. One of the systems to search for information related to keywords is named entity recognition (NER). To help address this problem, we release the OphNER (Ophthalmology NER) dataset - the first corpus containing nearly 9,000 sentences with more than a total of 17,447 labels of 16 entities. We also conduct experiments with state-of-the-art models. The highest result belongs to RoBERTa_large, which is better than XLM-R_large or XLNet_large. Our dataset is released on Github.

Hoang-Bao Le (Bảo)

Home

Research

CV

Selected Papers