Key research themes
1. How do domain-specific string-based and ontology-informed methods improve biomedical sentence similarity measurement with reproducibility?
This research theme centers on the development and evaluation of sentence similarity methods tailored to the biomedical domain, emphasizing the importance of reproducible experimental setups. Due to the highly specialized vocabulary, complex syntactic structures, and abundant acronyms in biomedical texts, conventional sentence similarity models often underperform. The theme addresses the integration of string-based techniques and ontology-based semantic methods, the impact of preprocessing stages and Named Entity Recognition (NER) tools on method performance, and the establishment of reproducible resources and protocols to enhance experimental rigor and comparability.
2. What are the impacts of lexical, syntactic, and semantic features, combined with vector-based distributional representations, on general domain sentence similarity?
This theme investigates the multifaceted role of lexical overlap, syntactic structure, and semantic frame alignment in modeling sentence similarity within general or cross-domain corpora. It explores supervised machine learning integration of these heterogeneous features, the utilization of distributional semantic models such as Random Indexing and Latent Semantic Analysis, and the embedding of syntactic and semantic information directly into vector representations (e.g., via vector permutations or recursive autoencoders). The research seeks to identify complementary strengths of diverse features to improve semantic textual similarity estimates beyond lexical matching alone.
3. How do hybrid and deep learning approaches integrating lexical relationships and sentence structure advance sentence similarity estimation?
This research focus explores the development of hybrid methodologies that combine deep learning models (e.g., CNNs, RNNs, BERT) with lexical knowledge-based techniques (e.g., WordNet) to measure semantic similarity between sentences. It emphasizes the need to incorporate lexical relationships, syntactic structures, word order, and semantic nuances such as determiners and negations. The goal is to improve similarity measures by capturing compositional semantic phenomena beyond simple lexical overlap, with particular attention to datasets and tasks where paraphrase detection and nuanced semantic differences are critical.