The purpose of the project is to develop a semantic search engine in English language to search among news imported from BBC.
To read a small introduction about serach engines, semantic search engines, and the structure of this project, please read an introducary report, here.
Project's Tasks
In order to achieve a semantic search engine, language processing processes are implemented on both documents to be searched in and user's query. Moreover, there are
some semantic linguistic processing that should be carried out either on the documents offline and for once or on the query online.
Since the query is smaller, the semantic analysis is applied on it.
The stages of developing and using this semantic serach engine can be summarized in three titles: Documents processing,
Query processing, and Searching and showing the results.
Documents processing
The first step is to read the data and parse it.
Then the parsed documents will be analyzed by using Standford CoreNlp functions, both the title and the content of each document are splitted to sentences,
then each sentence is splitted into tokens and the stop words are removed.
Finally, for each token its lemma is found.
The last step in this phase is to index and save the processed documents to be searched in in the further steps.
Query processing
The first step is to read the query which was entered by the user.
Then, the same linguistic processes was applied on the documents, will be applied on the query.
Moreover, by using CoreNlp functions, we select the Part of Speech for each token and search for synonms for.
Finally the query to be searched for will be recreated from the lemmas of the tokens in the original query
in addition to the lemmas of the synonms of the tokens in the original query.
Searching and showing the results
After reading and processing the user's query, the searching is done among the indexed documents,
and then the original documents are returned as results to be shown.
Results examples
In the following table, we can take alook at possible user's queries, and observe some of the returned news.
We can notice that a search result contains in either its title or content, the word in the query or one of its synonms.
Query |
Titles of Returned Documents |
Descriptions of Returned Documents |
women |
Indian is world's shortest woman |
Jyoti Amge, ... world's shortest woman by Guinness World Records. |
|
Woman set alight in New York fit |
A man is arrested after a 73-year-old woman is ... |
|
other results. |
|
Parting |
Mass burial for Philippines dead. |
A mass burial ..., which left at least 650 dead on the southern island of Mindanao. |
|
VIDEO: Loneliness of the long distance runner. |
Olympic hopeful Nader el Masri ... leave... |
|
other results. |
|
Skills covered in this project
- Skills
Natural Language Processing, Information Retrieval.
- Programing language
Java.
- Tools, Libraries, and Softwares
Lucene, Satndford CoreNlp, and WordNet.