This was a project for In4matx 141- Information Retrieval at UCI.
In this project, some files
are given with the main task being extracting future url links
to crawl through as well as validating them to not be traps. The requirements can be found here.
In this part, we are given a base website and a set of links. We crawl through the base website for external links and if the
link is in that given set, we crawl through those as well. Along the way, we save some analytics metrics such as the most common set of words
excluding stop words. Perhaps the most interesting part is filtering out traps - we removed non html links, non .ics.uci.edu links, addresses
longer than 100 characters, and links that were found to often in our recent history. These were chosen to stop infinite loops and to remain within
the set of links given.
This project was then extended with the ability to extract the most relevant links based on a given corpus (what was found above).
The purpose of this is to build a search engine of sorts.
The next set of requirements are
here.
In this part, a corpus is given to us along with a json file to connect files to links. This is to model the results from part 1.
Here we built an inverted index, storing data in json files starting with the starting character of each word to shorten the search speed.
To find relevance, we lemmatized words and changed numbers to word form. Then we retrieved the files containing the searched for term and
used cosine similarity to find best fits.
All this code can be found here.