Web Crawler

This was a project for In4matx 141- Information Retrieval at UCI.

In this project, some files are given with the main task being extracting future url links to crawl through as well as validating them to not be traps. The requirements can be found here.
In this part, we are given a base website and a set of links. We crawl through the base website for external links and if the link is in that given set, we crawl through those as well. Along the way, we save some analytics metrics such as the most common set of words excluding stop words. Perhaps the most interesting part is filtering out traps - we removed non html links, non .ics.uci.edu links, addresses longer than 100 characters, and links that were found to often in our recent history. These were chosen to stop infinite loops and to remain within the set of links given.

This project was then extended with the ability to extract the most relevant links based on a given corpus (what was found above). The purpose of this is to build a search engine of sorts. The next set of requirements are here. In this part, a corpus is given to us along with a json file to connect files to links. This is to model the results from part 1. Here we built an inverted index, storing data in json files starting with the starting character of each word to shorten the search speed. To find relevance, we lemmatized words and changed numbers to word form. Then we retrieved the files containing the searched for term and used cosine similarity to find best fits.

All this code can be found here.