Query Generation

description7 papers

group0 followers

lightbulbAbout this topic

Query Generation is the process of automatically creating search queries or questions from a given set of data or user input, aimed at retrieving relevant information from databases or search engines. It involves natural language processing and understanding to enhance information retrieval efficiency and accuracy.

lightbulbAbout this topic

Key research themes

1. How can automated query expansion and refinement improve information retrieval accuracy and query relevance?

This research area focuses on enhancing initial user queries by automatically adding, modifying, or selecting candidate terms to improve the relevance and coverage of retrieved documents. It matters because many users formulate brief or poorly constructed queries that cause low recall or precision in information retrieval. Automated expansion and refinement techniques harness linguistic resources, semantic similarity models, and query classification to systematically augment queries, balancing recall and precision.

Xu: An Automated Query Expansion and Optimization Tool

by Morgan Gallant

2022

Key finding: Xu demonstrates that automated query expansion using semantic similarity measures from Word2Vec and lexical APIs can improve recall by more than ten percent while managing precision using Boolean operators. The paper... Read more

articleView Paper downloadDownload

An Algorithmic Query Refinement Model based on Query Classification

by Behin Sam

2022, Indian Journal of Science and Technology

Key finding: This work develops and experimentally validates a hybrid refinement method combining ontology and thesaurus resources, guided by an initial classification of query types. Tested on TREC 2014 queries with real-time search... Read more

articleView Paper downloadDownload

Query optimization in database systems

by Nisar Ahmed

2022, ACM Computing Surveys (CSUR)

Key finding: This seminal survey outlines logic-based query transformations and cost-based optimization strategies that can be viewed as a foundational parallel to query expansion/refinement in databases. It emphasizes that heuristic... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

2. What are the challenges and methods to translating natural language queries into formal SQL queries for relational databases?

This research theme addresses the problem of bridging the gap between natural user language and structured query languages like SQL to enable non-expert users to retrieve data accurately from relational databases. It involves semantic parsing, syntactic and semantic analysis, and the use of grammars and machine learning methods to generate executable SQL commands from free-text inputs. Accurate SQL generation facilitates enhanced accessibility and user-friendly database querying.

Recent Advances in SQL Query Generation: A Survey

by Frosina Stojanovska

2021

Key finding: The survey presents an organized analysis of deep learning architectures (e.g., CNNs, RNNs, pointer networks, reinforcement learning) that have been applied to map natural language questions to SQL queries. It stresses the... Read more

articleView Paper downloadDownload

Automatic SQL Query Formation from Natural Language Query

by saparja dey and

2017

Key finding: This paper proposes a semantic grammar-based architecture for converting English queries into SQL commands, targeting users without SQL proficiency. It details the process including morphological, syntactic and semantic... Read more

articleView Paper downloadDownload

SQL Query Generator For Natural Language

by Amit Kumar Jaiswal and

2017

Key finding: SQGNL is designed as a database-independent system that utilizes linguistic dependencies and metadata to build sets of possible SELECT and WHERE clauses, generating multiple candidate SQL queries for a given natural language... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

3. How can query languages and interfaces be improved to facilitate intuitive, flexible, and efficient database querying for diverse user types?

This theme explores the human factors, linguistic, and logical foundations of query languages and interfaces, focusing on usability for novices and experts alike. It includes research into flexible query languages employing fuzzy logic, exemplarbased interfaces, and hierarchical taxonomies of query languages, aiming to reduce complexity and improve the expressiveness and accessibility of database querying.

No IFs, ANDs, or ORs: A Study of Database Querying

by Louis M Gomez

2017

Key finding: This experimental study compares traditional SQL query languages with a Truth-table Exemplar-Based Interface (TEBI), finding that users of TEBI performed better and with greater resilience to cognitive skill variability. The... Read more

articleView Paper downloadDownload

Flexible Query Languages for Relational Databases: An Overview

by Sławomir Zadrożny

2024, Studies in Fuzziness and Soft Computing

Key finding: The paper proposes taxonomies for flexible query languages based on fuzzy set theory, separating approaches for crisp and fuzzy relational databases. It demonstrates that fuzzy linguistic terms can better represent user... Read more

articleView Paper downloadDownload

Query languages—a taxonomy

by Yannis Vassiliou

2021

Key finding: This work presents a hierarchical taxonomy categorizing query languages by user interaction senses (e.g., visual, verbal) and method-level conceptual models and methodologies (e.g., declarative, imperative, programmatic). The... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

All papers in Query Generation

Optimistic Decision Making using an Approximate Graphical Model

by faiza khellaf

2025, International Journal of Artificial Intelligence & Applications

Min-based qualitative possibilistic networks are one of the effective tools for a compact representation of decision problems under uncertainty. The exact approaches for computing decision based on possibilistic networks are limited by... more

descriptionView Paper arrow_downwardDownload

A Framework for Plagiarism Detection in Arabic Documents

by abobakr bagais

2024, Computer Science & Information Technology ( CS & IT )

We are developing a web-based plagiarism detection system to detect plagiarism in written Arabic documents. This paper describes the proposed framework of our plagiarism detection system. The proposed plagiarism detection framework... more

descriptionView Paper arrow_downwardDownload

Semantic Extraction of Arabic Multiword Expressions

by Samah Meghawry

2023, Computer Science & Information Technology ( CS & IT )

A considerable interest has been given to Multiword Expression (MWEs) identification and treatment. The identification of MWEs affects the quality of results of different tasks heavily used in natural language processing (NLP) such as... more

descriptionView Paper arrow_downwardDownload

Task 1 of the CLEF eHealth Evaluation Lab 2014

by Jaume Nualart

2023

Discharge summaries serve a variety of aims, ranging from clinical care to legal purposes. They are also important tools in patient empowerment, but a patient's comprehension of the information is often suboptimal. Continuing in the... more

descriptionView Paper arrow_downwardDownload

Task 1 of the CLEF eHealth Evaluation Lab 2014 Visual-Interactive Search and Exploration of eHealth Data

by Jaume Nualart

2023, HAL (Le Centre pour la Communication Scientifique Directe)

Fig. 1. Summary of the CLEFeHealth2013 tasks and outcomes However, patients, their next-of-kin, and other laypersons are likely to perceive the readability of discharge summaries as poor, in other words, have difficul ies in under- standing their content (Fig. 1) [1]. Improving the readability of these summaries can empower patients, providing partial control and mastery over health and 0 patients making better health/care decisions, being more independen care services, and decreasing the associated costs [2]). Specifically, su tient-friendly, personalized language can help patients have an active health care and make informed decisions. Making the right decisions de for their empowerment. care, leading from health pportive, pa- role in their pends on pa- ients’ access to the right information at the right time; therefore, it is crucial to pro- vide patients with personalized and readable information about their health conditions

Fig.6 (continued from previous page). Task 1b inspiration: Example design of an information landscape for overviewing a set of answer documents. In this case, documents are mapped according to relevance and document complexity, with document metadata mapped to color and shape of document marks. See http://clefehealth2014.dcu.ie/ for original images.

Fig. 9. Printable pamphlet design for the optional case 4. This is intended for double-sided A4 printing. When the bottom figure is visible, the right hand side is to be folded first, followed by the left hand side. This results in the page | (6) to be on top (bottom). The design is also availa- ble at http://goo.gl/4y8PXT (accessed 11 June 2014).

Fig. 13. Workflow of producing electronic and paper-based documents. The design is also available at http://goo.gl/4y8PXT (accessed 11 June 2014).

descriptionView Paper arrow_downwardDownload

Multi document summarization based on news components using fuzzy cross-document relations

by Yogan Jaya Kumar

2023, Applied Soft Computing

Online information is growing enormously day by day with the blessing of World Wide Web. Search engines often provide users with abundant collection of articles; in particular, news articles which are retrieved from different news sources... more

descriptionView Paper arrow_downwardDownload

Distance Learning via Social Media

by Mohammad Derawi

2023, Computer Science & Information Technology (CS & IT)

In this research work, we examine one of the most applied networking website, namely the Facebook, for conducting courses as a replacement of valuable classical electronic learning platforms. At the initial stage of the Internet... more

Figure 2. Facebook page for course main menu By clicking “Create Page to start creating the page for the course’, the teaching staff can specified the course details on the webpage as in Figure 2 below. Finally, the teaching staff clicked “Create Page” to create the course page. Then, the lecturer could create photos albums for the lecture notes. The lecture clicked “Photos” to create a new photo album.

The lecturer then started uploading the lecture notes images to the photo album. Since different web browsers support different approaches of uploading images to a photo album, for example, Microsoft Internet Explorer and Google Chrome can make use of a Facebook plugin whereas Firefox uses a Java based component, the lecturer would experience different interface when using different web browsers. Once the photo album was created, the lecturer reviewed the images and rearranged the sequence of the images as necessary. Then, the photo album with lecture notes was ready to be accessed by students. For those students who would like to receive notification of course notes publishing, the lecturer could instruct them to use the function of adding themselves as fans of the course page. The lecturer could either rearrange the images or add new images by clicking “Organize Photos” or “Add Photo” buttons. For any further updates of the album, students with the role of fans of the course page would receive new notifications about the changes. The overview of the album is shown in Figure 4 and a screen showing the course content is shown in Figure 5.

Figure 5. Facebook page for course slide presentation

Figure 6. The image for a lecture note slide is shown by an Apple iPhone. The same will also appear in a LG mobile phone f the presentation file does not involve any transition effects, teaching staff could use the ‘acebook photo album web page to conduct the lecture/tutorial. The benefit was that if student vanted to raise any question and provide any feedback on the slide, they could post thei -omments for such slide and the teaching s »specially useful to students who were passi ndividual slide and it therefore facilitates th Students could also access to the photo al Ihones. Although the devices were small, srovide feedbacks or comments similar to aff would be notified immediately. Such feature wa: ve in the class. The comments posted were specific t« e discussion among teaching staff and students. bum with their own mobile devices, such as mobile they support zooming and enabled the students tc a computer. For example, Figure 6 shows the sam« ecture note slide to be shown by an Apple i Phone and a LG mobile phone respectively.

For slide with text in smaller typeface, most mobile phones enable users to zoom the images for better readability. When students wanted to leave comments or questions regarding the slide, they used their mobile device to do so. For example, Figure 7 shows the user interfaces of an Apple iPhone and a LG mobile phone, which enables Facebook student users to post comments to a slide. In fact, mobile devices are capable of viewing the slide and enable students to leave comments or questions to particular slide. Upon receiving comments or questions, all members in the course, including teaching staff, would be notified. As soon as teaching staff received a notification emails from Facebook, they could click the embedded link that navigates the web browser to the referred slide, and leave another comment for the same slide as responses. Teaching staff could create quizzes to assess students’ understandings of the lecture. For example, Figure 8 illustrate the use of Quiz Creator Facebook application by a teaching staff to create a quiz.

Figure 8. Specify the quiz questions and answers with Quiz Creator

descriptionView Paper arrow_downwardDownload

E-Education with Facebook - A Social Network Service

by Mohammad Derawi

2023, Computer Science & Information Technology ( CS & IT )

In this paper, we study the social networking website, Facebook, for conducting courses as a replacement of high-cost classical electronic learning platforms. At the early stage of the Internet community, users of the Interned used email... more

descriptionView Paper arrow_downwardDownload

A Framework for Plagiarism Detection in Arabic Documents

by Muazzam Siddiqui

2023, Computer Science & Information Technology ( CS & IT )

descriptionView Paper arrow_downwardDownload

Query Optimization in Arabic Plagiarism Detection: An Empirical Study

by Muazzam Siddiqui

2023, International Journal of Intelligent Systems and Applications

This article describes an ongoing research which intends to develop a plagiarism detection system for Arabic documents. We developed different heuristics to generate effective queries for document retrieval from the Web. The performance... more

descriptionView Paper arrow_downwardDownload

Automatic building information model query generation

by John Yen

2023, J. Inf. Technol. Constr.

Energy efficient building design and construction calls for extensive collaboration between different subfields of the Architecture, Engineering and Construction (AEC) community. Performing building design and construction engineering... more

descriptionView Paper arrow_downwardDownload

Distance Learning via Social Media

by Mohammad Derawi

2023, Computer Science & Information Technology (CS & IT)

descriptionView Paper arrow_downwardDownload

E-Education with Facebook - A Social Network Service

by Mohammad Derawi

2023, Computer Science & Information Technology ( CS & IT )

descriptionView Paper arrow_downwardDownload

Semantic Extraction of Arabic Multiword Expressions

by Akram Salah

2023, Computer Science & Information Technology ( CS & IT )

descriptionView Paper arrow_downwardDownload

Towards Automatic Building of Document Keywords

by Joaquim Silva

2023, The 23rd International Conference on …

Document keywords are associated to documents as summarized versions of the documents' content. Considering that the number of documents is quickly growing every day, the availability of these keywords is very important. Although,... more

Table 2: Precision and Recall Average Values for the Document MWE Descriptors. est results. In fact, due to its structure — see

descriptionView Paper arrow_downwardDownload

QUT IElab at CLEF 2018 Consumer Health Search Task: Knowledge Base Retrieval for Consumer Health Search

by Jimmy Jimmy

2023

In this paper we describe our participation to the CLEF 2018 Consumer Health Search Task, sub task IRTask1. This track aims to evaluate and advance search technologies aimed at supporting consumers to find health advice online. Our... more

descriptionView Paper arrow_downwardDownload

CLIQUES DETECTION vs MAXIMUM SPANNING TREE FOR TWEET CONTEXTUALIZATION

by lobna hlaoua

2022

Nowadays, social medias are very popular among their users. One of the most well-known social networks is Twitter. It is a micro-blog that enables its users to send short messages called tweets. A tweet is a 140 characters long message... more

descriptionView Paper arrow_downwardDownload

An Efficient Approach for Multi-Sentence Compression

by mohammad ebrahimi

2022

Multi Sentence Compression (MSC) is of great value to many real world applications, such as guided microblog summarization, opinion summarization and newswire summarization. Recently, word graph-based approaches have been proposed and... more

descriptionView Paper arrow_downwardDownload

Query Optimization in Arabic Plagiarism Detection: An Empirical Study

by Muhammad Imran

2022, International Journal of Intelligent Systems and Applications

descriptionView Paper arrow_downwardDownload

CLIQUES DETECTION vs MAXIMUM SPANNING TREE FOR TWEET CONTEXTUALIZATION

by lobna hlaoua

2022

descriptionView Paper arrow_downwardDownload

Building realistic potential patient queries for medical information retrieval evaluation

by Sanna Salanterä

2022

To evaluate and improve medical information retrieval, benchmarking data sets need to be created. Few benchmarks have been focusing on patients’ information needs. There is a need for additional benchmarks to enable research into... more

descriptionView Paper arrow_downwardDownload

Computer-based plagiarism detection techniques: A comparative study

by Asst. Prof. Dr. Mohammed S. H. Al-Tamimi

2022

Plagiarism is becoming more of a problem in academics. It's made worse by the ease with which a wide range of resources can be found on the internet, as well as the ease with which they can be copied and pasted. It is academic theft since... more

2.1. Textual Plagiarism Textual plagiarism and source code plagiarism are the two forms of PD methods; as shown in Figure [? |, different types of PD approach|28]

Figure 2: Plagiarism categories PD can be classified based on the language of the texts being processed as (Mono-Lingua) if th source and suspect documents use the same language or (Cross-Lingual) if the languages are diverse Automatic PD uses a reference corpus that compares a suspect text to a collection of papers t« identify the source of the plagiarized pieces. The source and suspect documents may be written i the same language (Mono-Lingua) such as |] plagiarism categories shown in Figure 2 Einglish-] English or different languages (Cross-Lingual)

Plagiarism detection techniques are essential for identifying instances of plagiarism; the stolen mate rial must be distinguished from the original by a plagiarism detection function. This procedure car occasionally validate the quantity of material that is plagiarized [18]. PD is the method of separatin; the document’s characteristics, assessing its content, identifying potentially plagiarized sections, anc getting similar remaining documents to light if they are accessible. This method can improve PL performance by eliminating the selection of source texts and incorporating semantic relationship: between words and their structural composition [10]. Sentences with a high degree of resemblance tx suspicious text sentences but distinct meaning Plagiarism detection system is shown in Figure 3/4]

descriptionView Paper arrow_downwardDownload

Comparison of Several Word embedding Sources for Medical Information Retrieval

by Julie BU DAHER

2022

Abstract. This paper describes the participation of MRIM team in Task 3: Patient-Centered Information Retrieval-IRTask 1: Ad-hoc search of CLEF eHealth Evaluation lab 2016. The aim of this task is to evaluate the effectiveness of... more

descriptionView Paper arrow_downwardDownload

International Journal of Computer Sciences and Engineering Open Access

by Dr.S.Santhana Megala Asst Prof, SCS

2022

An Automatic Summary generation process creates a shortened version of the text using a Digital programming Technology, with the aim of holding the most advanced important points of the original text. In a Common Law system, previous... more

descriptionView Paper arrow_downwardDownload

CLIQUES DETECTION vs MAXIMUM SPANNING TREE FOR TWEET CONTEXTUALIZATION

by Amira Dhokar

2021

descriptionView Paper arrow_downwardDownload

Discovering Clusters of Plagiarism in Students’ Source Codes

by Lefteris Moussiades

2021, Journal of Engineering Science and Technology Review

Plagiarism in students' source codes constitutes an important drawback for the educational process. In addition, plagiarism detection in source codes is time consuming and tiresome task. Therefore, many approaches for plagiarism detection... more

descriptionView Paper arrow_downwardDownload

The IR Task at the CLEF eHealth Evaluation Lab 2016: User-centred Health Information Retrieval

by Julie BU DAHER

2021

This paper details the collection, systems and evaluation methods used in the IR Task of the CLEF 2016 eHealth Evaluation Lab. This task investigates the e↵ectiveness of web search engines in providing access to medical information for... more

descriptionView Paper arrow_downwardDownload

Task 1 of the CLEF eHealth Evaluation Lab 2014 Visual-Interactive Search and Exploration of eHealth Data

by Jaume Nualart

2021

descriptionView Paper arrow_downwardDownload

A Survey of Cross-Lingual Plagiarism Detection using Natural Language Processing

by IJRASET Publication

2020, International Journal for Research in Applied Science and Engineering Technology IJRASET

Plagiarism detection is gaining importance due to requirements for integrity in Research works especially when it comes to Cross-lingual plagiarism. In this paper, we have researched a new approach for Cross-Lingual sentence level... more

descriptionView Paper arrow_downwardDownload

Algorithm to Identify the Connection between Sentences

by Dr. Praveen Sankarasubramanian

2020, INTERNATIONAL JOURNAL OF INFORMATION AND COMPUTING SCIENCE

Many applications require the affiliation of sentences (which includes text summarization, answering questions, producing natural language, analyzing natural language, and text clustering). The similarity of terms may be improved using... more

descriptionView Paper arrow_downwardDownload

Document overlap detection system for distributed digital libraries

by Heinz Schmidt

2020, Proceedings of the fifth ACM conference on Digital libraries - DL '00

In this paper we introduce the MatchDetectReveal(MDR) system, which is capable of identifying overlapping and plagiarised documents. Each component of the system is briefly described. The matching-engine component uses a modified suffix... more

descriptionView Paper arrow_downwardDownload

Classification of Legal Judgement Summary using Conditional Random Field Algorithm

by santhana megala

2019, IJCSE

In previous researches, Probabilistic and the rule-based techniques were used for the text summarization, but Legal judgment summarization is a tedious process, and it is not easy to find out the important sentences like any other document. A single word that occurs only one time in the judgment may belong to an important one. Hence to obtain a good Judgment summarization, enhanced methods were needed. Figure 1 — Overall System Architecture of a Legal Judgement Summarization System

Figure: 2 — Architecture of the Text Summarization System based on Fuzzy Logic In the Defuzzification step, the output membership function step is divided into three membership functions, namely, "Unimportant", "Average" and "Important", which convert the result of the inference engine into a crisp output to obtain a final sentence score for each sentence.

The values a, b and c were the standard values of Low, Medium and High respectively and the values 1, m and n were the calculated values of Low, Medium and High respectively. Defining IF-Then rules is important in the Inference Engine. Sample IF — Then rules for the Inference Engine based on the Feature Extraction measures are mentioned below.

The most important benefits of CRFs over Hidden Markov Models i.e, HMMs is their conditional nature, ensuing in the relaxation of the independence assumptions required by HMMs in order to make a certain tractable inference. The architecture of the CRF approach, to generate a summary for the given legal judgment document was depicted in Figure 4.

Let Ptr(k,t) be the function that returns the value of x used tc compute V,,, if t>1, or k if t=1. Then: Figure 5 - Keep tracking the likely sequence states using VA

Figure 6 - Architecture to generate Legal Judgment summary using LDA After the pre-processing all the sentences present in the document were sent to LDA as bag of words and the outcome of the process is some different topics, based on the probabilistic model. Now a set of topics for the given corpus using LDA topic model is derived. Consider each judgment from the corpus and find the sentences present in the document using Sentence Boundary method.

Figure: 7 - Graphical representations of F-Measure value for CRF & LDA

Figure: 8- ROC Curves for Fuzzy Logic & Online Summarizer Table: 5 - Comparison between online summarizer and fuzzy logic based on TP & FP Rate ROC analysis is related in a direct and natural way to cost/benefit analysis of diagnostic decision making. ROC Curve for our system grows on the left top border, which shows a good accuracy.

Algorithm based on the Rhetorical Roles present in the Legal Judgment document. Figure: 9 — Sample Output for Judgement Summarization using Fuzzy Logic

CRFs make a first-order Markov independence assumption with binary feature functions to link the output nodes of the graphical model in a linear chain by edges and thus can be understood as conditionally-trained finite state machines (FSMs) which are suitable for segmentation and sentence labeling [10].

The Viterbi algorithm used to finding the shortest route through a graph is shown below:

Algorithm : Sentence Score Generation The algorithm to find the sentence score for each sentence from the given judgment is shown below. Consider all the sentences S,,r€ {1,...,R} in the documents and all the Topics Tj,.j€ {1.,...,.K} and then by calculating the probability of the Sentence S, for the given the Topic T; i. P(S, |T;). Thus calculating the probability for the sentence S, belongs or represents the topic T; . Let the words of the sentence S, be {W),W5, ... Wg}. Algorithm to find sentence score for each sentence based on each topic for the entire corpus was given below:

A rich set of features were included in this paper to identify the rhetorical roles present in the Legal judgment. International Journal of Computer Sciences and Engineering

Table: 3- Precision, Recall & F- Measure Value for the Seven Segments using CRF& LDA

Table 4 — T- Test table for the Null Hypothesis HO based on the average F-Measures obtained for CRF & LDA. Based on the above results it clearly shows that the Null Hypothesis that stated was rejected because t value calculated is greater than the t critical value ie., 7.039 > 2.306. Hence the p-value obtained for the calculated t score is 0.00054, which is less than 0.01, ie., 0.00054<0.01. Therefore the result is significant at p<0.01. The hypothesis In this paper, the average F-Measures of the sample Legal Documents were taken as the performance measures for the statistical test. A Null Hypothesis HO was set by stating that there is no difference between the results generated by the Conditional Random Field & Latent Dirichlet allocation. On the other hand, an Alternative Hypothesis H1 indicating that there is a difference between the results generated by the Conditional Random Field & Latent Dirichlet allocation. Based on the Statistical Paired t-test results denoted in Table 4, it clearly states that CRF method provides results better than the LDA Method.

Figure: 10 — Sample Output for the Structured Summary using Classification Technique

descriptionView Paper arrow_downwardDownload

Data Mining for Prediction of Human Performance Capability in the Software Industry

by Gaurav Thakur

2017, International Journal of Data Mining & Knowledge Management Process

The recruitment of new personnel is one of the most essential business processes which affect the quality of human capital within any company. It is highly essential for the companies to ensure the recruitment of right talent to maintain... more

descriptionView Paper arrow_downwardDownload

A Method For Arabic Documents Plagiarism Detection

by Yahya Ali and

2017

Plagiarism has become an infamous problem in the global academic community. Detecting plagiarism in Arabic documents is particularly a challenging task due to the complexity of the structure of the language. This paper introduces a... more

descriptionView Paper arrow_downwardDownload

A Framework for Plagiarism Detection in Arabic Documents

by Imtiaz Khan

2017, Computer Science & Information Technology ( CS & IT )

descriptionView Paper arrow_downwardDownload

Query Optimization in Arabic Plagiarism Detection: An Empirical Study

by Imtiaz Khan

2017, International Journal of Intelligent Systems and Applications

descriptionView Paper arrow_downwardDownload

A New Hybrid Metric for Verifying Parallel Corpora of Arabic English

by Wei Liu and

2016, Computer Science & Information Technology ( CS & IT )

This paper discusses a new metric that has been applied to verify the quality in translation between sentence pairs in parallel corpora of Arabic-English. This metric combines two techniques, one based on sentence length and the other... more

descriptionView Paper arrow_downwardDownload

Universal Mobile Information Retrieval

by Sebastião Pais

2016

The shift in human computer interaction from desktop computing to mobile interaction highly influences the needs for new designed interfaces. In this paper, we address the issue of searching for information on mobile devices, an area also... more

descriptionView Paper arrow_downwardDownload

PDLK:Plagiarismdetectionusinglinguisticknowledge

by Asad Abdi, Ph.D.

2015

Plagiarismisdescribedasthereuseofsomeoneelse’spreviousideas,workorevenwordswithoutsufficientattributiontothesource.Thispaperpresentsamethodtodetectexternalplagiarismusingtheintegrationofsemanticrelationsbetweenwordsandtheirsyntacticcomposit... more

descriptionView Paper arrow_downwardDownload

Optimistic Decision Making using an Approximate Graphical Model

by Boutouhami Khaoula

2015, International Journal of Artificial Intelligence & Applications

descriptionView Paper arrow_downwardDownload

REAL TIME CLUSTERING OF TIME SERIES USING TRIANGULAR POTENTIALS

by Computer Science & Information Technology (CS & IT) Computer Science Conference Proceedings (CSCP)

2015

Motivated by the problem of computing investment portfolio weightings we investigate various methods of clustering as alternatives to traditional mean-variance approaches. Such methods can have significant benefits from a practical point... more

descriptionView Paper arrow_downwardDownload

AN APPROXIMATE POSSIBILISTIC GRAPHICAL MODEL FOR COMPUTING OPTIMISTIC QUALITATIVE DECISION

by Computer Science & Information Technology (CS & IT) Computer Science Conference Proceedings (CSCP) and

2015

descriptionView Paper arrow_downwardDownload

KNOWLEDGE MANAGEMENT IN HIGHER EDUCATION : APPLICABILITY OF LKMC MODEL IN SAUDI UNIVERSITIES

by Computer Science & Information Technology (CS & IT) Computer Science Conference Proceedings (CSCP)

2015

This paper stresses on the need of using Knowledge Management (KM) in the higher education institutions of Saudi Arabia. The paper is based on the literature review and personal experience of the author in the education sector. The paper... more

descriptionView Paper arrow_downwardDownload

A NEW HYBRID METRIC FOR VERIFYING PARALLEL CORPORA OF ARABIC-ENGLISH

by Computer Science & Information Technology (CS & IT) Computer Science Conference Proceedings (CSCP)

2015

Figure 1. Tendencies of the classification of satisfactory translations and unsatisfactory translations for tes! Corpus B with different threshold values. is set as low as 1.25 (meaning most sentence pairs will be rejected). The only calculation that results in an average accuracy of 100% for all sentence pairs (both satisfactory and unsatisfactory) occurs when both SLR and CR are combined together with a threshold of 2.5. Figure 1 shows the tendencies of the classification of the satisfactory translations and unsatisfactory translations for test Corpus B using different threshold values. A further experiment was conducted to investigate whether different threshold values are more effective when using the combined SLR&CR technique. Table 9 displays the accuracy results matrix of the experiments on the overall accuracy averages on the same 10000 satisfactory translations and 2000 unsatisfactory translations in test Corpus B used in the previous experiment. In the table, the SLR threshold value is shown across the top row, and the CR threshold value is shown down the left column, both ranging from 1.25 up to 3.50. The table shows that 100%

Figure 2. Sentence length correlation for test Corpus A forsentence pairs classified as unsatisfactory.

Figure 3. Code length correlation for test Corpus A for sentence pairs classified as unsatisfactory.

Computer Science & Information Technology (CS & IT) Figure 4. Sentence length correlation for test Corpus A for translations classified as satisfactory.

Computer Science & Information Technology (CS & IT) Figure 5. Code length correlationfor test Corpus A for translations classified as satisfactory.

Figure 6.Flow chart of how the new hybrid sentence matching metric based on both compression code length and sentence length was applied to test Corpus A.

Table 1.Alist of differences between the Arabic and English languages. The use of parallel Arabic-English corpora to train statistical MT models provides an effective way for building MT systems. However, Arabic-English parallel texts of high quality are still very limited and are not available in satisfactory quantities, therefore most translations are performed manually, a time consuming and often error-filled process. Limitations of existing parallel corpora include incomplete data, untagged entries, with only limited text genres being available (such as news stories). In addition, many of the better quality corpora are not available for public use with fees in the thousands of dollars. For example, a list of corpora that were available from the Linguistic Data Corporation (LDC) in 2013 at the beginning of our research project is shown in Table 2 [12].These costs are often unaffordable for most students, and also for many researchers or small research groups.

Table 2. Parallel Arabic-English Corpora as provided by the LDC in 2013 [12].

3.2.Code Length Ratio Distance Metric for Matching Sentences Table 3. PPMD order 3 model after processing the text string“ 4»ubulliw” The term code length refers to the size (in bytes) of the compressed output file produced by the PPM compression algorithm. When using PPM to compress Arabic or English text, the code length is a measure of the cross-entropy of the text, which is the average size (in bytes) per character forthe compressed output string. Theoretically, the cross-entropy is estimated as follows:

Table 4. Character and word counts for test Corpus A. 4.2. Compression Experiments

Table 5. Sample sentence pairs that were used in the initial compression experiments. The results of compressing these sentences using the PPM compression scheme are shown in Table 6. The table lists the number of bytes that various variants of PPM produced as compressed output. For example, for sentence pair with id 1 (i.e. the first in Table 5), the WOT variant required 69 bytes to compress the Arabic sentence, compared to 69 bytes to compress the English sentence. In contrast, the sentence lengths are very different — the Arabic sentence is 59 characters (bytes) long compared to 95 characters for the English sentence. In a preliminary experiment, 10 sample sentence pairs in Arabic and English were randomly chosen from Corpus A. The 10 sample sentence pairs that we used are shown in Table 5

Table 6. Compression results of the sample sentences. The PPMD5 compression code length results list the size in bytes of the compressed output produced by various variants of the PPMD5 compressor. From the table, we can see there is a clear mis-match as expected between the Arabic and English sentence lengths. This provides clear evidence that metrics based on techniques well founded in information theory (as is the case for compression code length based metrics) have merit since they lead to better correlation.

Table 7.Percentage of Arabic sentences lengths or compression code lengths greater than their English sentence counterparts for the test Corpus A. These results provide reassuring evidence that the compression methods we have adopted produce the desired (and necessary) correlated data for the subsequent experiments we conducted that are described in the next section.

Table 9. The accuracy results matrix for test Corpus B using threshold values of SLR and CR from 1.25 to 3.50. Another experiment was devised to determine how much of the larger test Corpus A would be classified as satisfactory or unsatisfactory using various CR threshold values (from 1.25 to 3.50) when the SLR threshold value was set at 2.5. The results of this experiment are shown in Table10. The table shows the number classified in each category (in the columns labelled ““Amount’’) and the corresponding percentages. For example, a threshold value of 2.50 for both SLR and CR results in 8.18% of test Corpus A being labelled unsatisfactory (and therefore candidates for being removed from the corpus).

Table 10: Percentages of satisfactory and unsatisfactory translations for test Corpus A when the SLA threshold is set at 2.5. Figures 2, 3, 4 and 5 show correlations for the sentence length and code length metrics for test Corpus A. Figures 2 and3 illustrate the sentence lengths and code lengths of Arabic and English sentences classified as unsatisfactory for the test Corpus Aand show an obvious split in the plo due to 1:2 and 2:1 type mismatches.

descriptionView Paper arrow_downwardDownload

E-EDUCATION WITH FACEBOOK – A SOCIAL NETWORK SERVICE

by Computer Science & Information Technology (CS & IT) Computer Science Conference Proceedings (CSCP) and

2015

descriptionView Paper arrow_downwardDownload

ANALYSIS OF COMPUTATIONAL COMPLEXITY FOR HT-BASED FINGERPRINT ALIGNMENT ALGORITHMS ON JAVA CARD ENVIRONMENT

by Computer Science & Information Technology (CS & IT) Computer Science Conference Proceedings (CSCP)

2015

In this paper, implementations of three Hough Transform based fingerprint alignment algorithms are analyzed with respect to time complexity on Java Card environment. Three algorithms are: Local Match Based Approach (LMBA), Discretized... more

descriptionView Paper arrow_downwardDownload

A FRAMEWORK FOR PLAGIARISM DETECTION IN ARABIC DOCUMENTS

by Computer Science & Information Technology (CS & IT) Computer Science Conference Proceedings (CSCP)

2015

descriptionView Paper arrow_downwardDownload

Survey on Clustering Algorithm for Sentence Level Text

by IJCSMC Journal

2014

Clustering is an extensively studied data mining problem in the text domains. The difficulty finds numerous applications in customer segmentation, classification, collaborative filtering, visualization, document organization, and... more

descriptionView Paper arrow_downwardDownload

Plagiarism Detection In Arabic Scripts Using Fuzzy Information Retrieval

by Salha Alzahrani

2013

Abstract—The nature of Arabic language structure exposes the need for fuzzy or vague concept to reveal dishonest practices in Arabic documents. In this paper, we present a statement-based plagiarism detection approach in Arabic scripts... more

Figure 1. The effect of removing Arabic stopwords on the number of words and size for the first 10 documents

Figure 2. Statement Distribution in our Corpus Collection they are semantically different based on the degree of similarity among words in both. Similarity between two statements has two cases: restructuring (i.e. changing the structure such as from active to passive) and rewording (replacing words with synonyms and antonyms). Fuzzy-set IR model [6, 17] can be used to judge similarity in both cases. This section describes the methodology used to adopt Arabic fuzzy-set IR model as in [6]. Table I exemplifies a pair of similar but restructured Arabic statements.

and antonyms. Our Arabic fuzzy-set IR detected only less than half of them. That is, the problem is not with restructuring but with rewording. This is because our term- fo-term correlation matrix does not have enough pairs of ferms with their synonyms and antonyms. Besides, we found that 25% of plagiarized statements detected in case (v) were mostly restructured but not reworded statements. Precision and recall in Figure 3 were calculated as illustrated in (5) and (6). As can be seen, the first three cases were optimal or near optimal since most of the statements were either duplicates or semantically the same but with different structure. In contrast, the effectiveness of our retrieval model measured in precision and recall in the last two cases was not encouraging, which gives a remarkable point for further enhancement of our model.

LE I. EXAMPLE SIMILAR ARABIC STATEMENTS To start with, we generate a different pairs of Arabic terms from both CDocs and QDocs. Samples of pairs are listed in Table II. Note that we use “term” here to refer to non-stop, stemmed words.

a. English translation has been provided here to make Arabic words clearer to the reader but it does not interfere with programming TABLE IL. SAMPLE OF ARABIC TERMS PAIRS

TABLE II. TERM-TO-TERM CORRELATION FACTOR Thirdly, we construct a term-to-term correlation matrix that consists of term pairs and their correlation factors as seen in Table IV.

descriptionView Paper arrow_downwardDownload

Report on the ECIR 2011 workshop on information retrieval over query sessions

by Benjamin Carterette

2013

Abstract Research in Information Retrieval has traditionally focused on serving the best results for a single query. Real users, however, often begin an interaction with a search engine with a sufficiently under-specified information need... more

descriptionView Paper arrow_downwardDownload