HTML parser with NLTK

From dandi08

Jump to: navigation, search

Contents

[edit] Overview

Cody purchased a web domain http://www.webstripper.org for this project, as he is planning on developing it long after this class is over.

The program will have you log into the server (all participants on this project will be given a username and password) you then feed the program a searchable string. The program crawls the pages of the most realevant results from popular search engines. the program will then find the text within the document and extract it. (Get rid of all the ads, links, and clutter) It then re-writes that text to an HTML document and ads a link to where the information came from to the bottom of the page for each result that it crawled. It then uploads the HTML documents to our own server where they can be categorized and searched for at a later time.

The result should be something like this: I log into the server and type in the string "how to care for rabbits". it will look extract the text from the 5-10 most releveant results from a few search engines. almost instantly, we will have 20 - 30 texts on how to care for rabbits uploaded to our online database.

[edit] Updates

If you haven't already, I would recommend going over Section 2.2.2 (Lemmatization and Normalization) in the NLTK book. It is likely that will be using this technique to strip our target webpages. Sections 12.3.1 (Spiders) and 12.3.2 (Creating Language Resources Using Word Processors), especially Listing 12.1 on p. 307, might also be relevant.

[edit] Meeting Times

We will be meeting in the computer lab after class on Thursdays. Try to have a couple hours free, but we won't always need to work for that long.


[edit] Timesheets

Nov. 04, 08
Nov. 06, 08

[edit] Documentation

HTML Parser Proposal - last updated 11/06/08

[edit] Team Members

User: boncod27
Cody Bonney
boncod27@evergreen.edu
Cody's Facebook

User: soklin30
Linda Sok
soklin30@evergreen.edu
Linda's Facebook


User: bowjam16
Jamie Bown
bowjam16@evergreen.edu