EBook Keyword index program in java-Data Structures and Algorithms


CS2022 – Data Structures and Algorithms 
B. Sc. Engineering  Semester 2
Programming Project

1  Introduction 
1.1  The Idea
E-books are  becoming  more  and  more  popular  due  to  various  reasons.  In  this
project, you are required to create and maintain a keyword index for an e-book. The
keywords  are  extracted  from  the  book  and  are  made  available  to  you  in  the
following format.


Contents of the File “Input.txt” 
 Keyword List Start 
Bubble  3 
Merge  1 
Heapsort  1 
Heap  1 
Sort  2 
Greedy  2 
Complexity   2 
game   3 
sort  1 
Keyword List End 
Queries Start 
First heap 
List sort 
List Bubble 
Keywords 1 
Keywords 2 
Queries End


1.2  Output of the Program 

1 2 

Heap Heapsort Merge sort 
Complexity Greedy Sort  


In  the  above list,  each  line  contains  a  string  and  a  number  separated by a “tab”
character. The string is the keyword and the number represents a page number the
keyword  appeared.  You  are  expected  to  read  the  input  file and  store  the  keywords
and the pages they appear. The system should provide the following functionality.
1.  Given a keyword, output the first page the keyword appeared.
2.  Given a keyword, output the sorted list of pages (in ascending order) a given
keyword appeared.
3.  Given a page number, output the sorted list of keywords (in ascending order)
that appeared in the page.
The operations will be specified in the following format in the input file.
1.  First<whitespace><Keyword>
     a.  First Keyword7
2.  List<whitespace><Keyword>
     a.  List Keyword1
3.  Keywords<whitespace><page_number>
     a.  Keywords 2


The  output  of  the  operations  should  be  displayed  in  the  standard  output  in  the
following manner.
  The result of each query should be listed in a single line.
  For  the  second  operation,  the  pages  should  be  separated  by  a  single
whitespace.
  For  the  third operation,  the  keywords  should  be  separated  by  a  single
whitespace.
  If  the  specified  keyword  is  not  present  in  the  input,  the  output  should  be
“Invalid Keyword”.
  If the specified page number is not present in the input, the output should be
“Invalid Page Number”.

1.2  Additional Considerations:
  The operations  1  and  2  (Searching  the  first  page/pages  the  keyword
appeared)  are  the  most  frequent  operations  and  needs  to  be
implemented efficiently. You have to consider that the number of keywords
may  be  very  high  and  all  the  keywords  are  specified  at  the  start  of  the
program.
  You  are  expected  to  implement  all  the  data  structures,  and  sorting  and
searching algorithms that you use in the program.
  Input to the program will be the name of the input file and all the outputs (of
the queries) should be written to the standard output. Any output other than
the  results  of  queries  (if  there  are  any)  should  be  written  to  a  separate
output file (not to the standard output).
  For  the  implementation  purposes,  you  can  assume  that  the  maximum
number of keywords to be 500 and the maximum number of pages to be 200.




Answer


A.  DESIGN OF SOLUTION

 

WordTree Data Structure 
This is the data structure I used to store keywords given from the input text file.
Example:-


This is basically a tree data structure. At a node, it can have 0 or any number of branches up to 26. This division is
based on the order of the alphabet. In this structure each word has a unique position.



B.  SELECTING SUITABLE DATA STRUCTURES/ALGORITHMS 
  Hash table & a Tree are main two data structures that I consider to use to store keywords. Both of them has
good properties also bad properties.

Main disadvantages of using hash table 

  Initially we should have some idea about the number of keywords, if the number of keywords is much                larger than we expected, efficiency of searching will be tremendously low.
   If collisions occurred, reduces efficiency when searching a keyword that collided.
  Hard to find perfect hash function.

Main advantages of using WordTree data structure

  Initially we do not need to know the number of keywords. Because this is dynamically create variables to add new keywords.
  When the number of keyword very high, does not make any change of efficiency of searching a keyword.
  When storing keywords, it stores in sorted order. No need to sort when showing list of keywords.
  I have used a dynamic array to store page numbers for a keyword, this reduce space consumption.
  Time taken to search a keyword only depends on the number of letters of the keyword.(T α n
letters )

Selecting sorting algorithm to sort page numbers for a keyword 

  Normally number of pages for a keyword is less than 10. So it is useless to use a complex algorithm for
sorting. Because as the number of values to be sorted is little, there is no much difference of using a good efficient complex sorting algorithm like Merge sort and using simpler algorithm like Bubblesort.

C.  ASSUMPTIONS I MADE. 
  Keywords will not have letters like α, β, π (letters which are not in English alphabet) .
  Even if the question says to assume maximum number of keywords to be 500 and maximum number
of pages to be 200, I didn’t assume so (I designed the solution for any number of keywords and
pages).

D.  PROBLEMS I FACED
  How to list keywords for a given page number?
To answer this either I have to use separate data structure or improve the WordTree(the
data structure that I implemented) . If I use a separate data structure, then I have to
implement methods to sort keywords in ascending order. This makes the program so
complex and increases the space consumption. Also when storing data WordTree stores
them in a sorted order. So I used same data structure for both purposes.
  Problems with programming language.
With help of my friends and internet I could manage it.

E.  HOW TO IMPROVE MY SOLUTION
  When answering question 1 and 2, in my program gets the page numbers from tree and sort them
and show answers. If I could sort page numbers when storing, then it don’t has to sort when
answering questions. This will increase performance of both operations. But it will increase the
startup time of program.
  By using more efficient sorting algorithms, performance will be increased.


Download Source code and PDF version of answer
































Comments