CS2022 – Data Structures and Algorithms
B. Sc. Engineering Semester 2
Programming Project
1 Introduction
1.1 The Idea
E-books are becoming more and more popular due to various reasons. In this
project, you are required to create and maintain a keyword index for an e-book. The
keywords are extracted from the book and are made available to you in the
following format.
Contents of the File “Input.txt”
Keyword List Start
Bubble 3
Merge 1
Heapsort 1
Heap 1
Sort 2
Greedy 2
Complexity 2
game 3
sort 1
Keyword List End
Queries Start
First heap
List sort
List Bubble
Keywords 1
Keywords 2
Queries End
1.2 Output of the Program
1
1 2
3
Heap Heapsort Merge sort
Complexity Greedy Sort
In the above list, each line contains a string and a number separated by a “tab”
character. The string is the keyword and the number represents a page number the
keyword appeared. You are expected to read the input file and store the keywords
and the pages they appear. The system should provide the following functionality.
1. Given a keyword, output the first page the keyword appeared.
2. Given a keyword, output the sorted list of pages (in ascending order) a given
keyword appeared.
3. Given a page number, output the sorted list of keywords (in ascending order)
that appeared in the page.
The operations will be specified in the following format in the input file.
1. First<whitespace><Keyword>
a. First Keyword7
2. List<whitespace><Keyword>
a. List Keyword1
3. Keywords<whitespace><page_number>
a. Keywords 2
The output of the operations should be displayed in the standard output in the
following manner.
The result of each query should be listed in a single line.
For the second operation, the pages should be separated by a single
whitespace.
For the third operation, the keywords should be separated by a single
whitespace.
If the specified keyword is not present in the input, the output should be
“Invalid Keyword”.
If the specified page number is not present in the input, the output should be
“Invalid Page Number”.
1.2 Additional Considerations:
The operations 1 and 2 (Searching the first page/pages the keyword
appeared) are the most frequent operations and needs to be
implemented efficiently. You have to consider that the number of keywords
may be very high and all the keywords are specified at the start of the
program.
You are expected to implement all the data structures, and sorting and
searching algorithms that you use in the program.
Input to the program will be the name of the input file and all the outputs (of
the queries) should be written to the standard output. Any output other than
the results of queries (if there are any) should be written to a separate
output file (not to the standard output).
For the implementation purposes, you can assume that the maximum
number of keywords to be 500 and the maximum number of pages to be 200.
Answer
A. DESIGN OF SOLUTION
WordTree Data Structure
This is the data structure I used to store keywords given from the input text file.
Example:-
This is basically a tree data structure. At a node, it can have 0 or any number of branches up to 26. This division is
based on the order of the alphabet. In this structure each word has a unique position.
B. SELECTING SUITABLE DATA STRUCTURES/ALGORITHMS
Hash table & a Tree are main two data structures that I consider to use to store keywords. Both of them has
good properties also bad properties.
Main disadvantages of using hash table
Initially we should have some idea about the number of keywords, if the number of keywords is much larger than we expected, efficiency of searching will be tremendously low.
If collisions occurred, reduces efficiency when searching a keyword that collided.
Hard to find perfect hash function.
Main advantages of using WordTree data structure
Initially we do not need to know the number of keywords. Because this is dynamically create variables to add new keywords.
When the number of keyword very high, does not make any change of efficiency of searching a keyword.
When storing keywords, it stores in sorted order. No need to sort when showing list of keywords.
I have used a dynamic array to store page numbers for a keyword, this reduce space consumption.
Time taken to search a keyword only depends on the number of letters of the keyword.(T α n
letters )
Selecting sorting algorithm to sort page numbers for a keyword
Normally number of pages for a keyword is less than 10. So it is useless to use a complex algorithm for
sorting. Because as the number of values to be sorted is little, there is no much difference of using a good efficient complex sorting algorithm like Merge sort and using simpler algorithm like Bubblesort.
C. ASSUMPTIONS I MADE.
Keywords will not have letters like α, β, π (letters which are not in English alphabet) .
Even if the question says to assume maximum number of keywords to be 500 and maximum number
of pages to be 200, I didn’t assume so (I designed the solution for any number of keywords and
pages).
D. PROBLEMS I FACED
How to list keywords for a given page number?
To answer this either I have to use separate data structure or improve the WordTree(the
data structure that I implemented) . If I use a separate data structure, then I have to
implement methods to sort keywords in ascending order. This makes the program so
complex and increases the space consumption. Also when storing data WordTree stores
them in a sorted order. So I used same data structure for both purposes.
Problems with programming language.
With help of my friends and internet I could manage it.
E. HOW TO IMPROVE MY SOLUTION
When answering question 1 and 2, in my program gets the page numbers from tree and sort them
and show answers. If I could sort page numbers when storing, then it don’t has to sort when
answering questions. This will increase performance of both operations. But it will increase the
startup time of program.
By using more efficient sorting algorithms, performance will be increased.
Download Source code and PDF version of answer
Comments
Post a Comment