nolancash / know-crawler Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 1.01 MB

Automatically exported from code.google.com/p/know-crawler

PHP 12.57% CSS 0.71% JavaScript 1.09% HTML 0.52% Python 84.85% Shell 0.26%

know-crawler's People

Watchers

know-crawler's Issues

Continually test integrity of the database make sure no invalid articles are submitted

What steps will reproduce the problem?
1. Run the tests in db_manager_test and db_test
2. Manually inspect the database until final release daily.

What is the expected output? What do you see instead?
Only articles with all the required information should be in the database.

Please use labels and text to provide additional information.

Original issue reported on code.google.com by [email protected] on 15 May 2012 at 2:18

Parse locations in articles

What steps will reproduce the problem?
1. Use the text body to find relevant locations in the article.
2. Check against a list of major cities/provinces/states/etc.

What is the expected output? What do you see instead?
A list of locations relevant to the article to be inserted into the database.

Please use labels and text to provide additional information.

Original issue reported on code.google.com by [email protected] on 25 May 2012 at 9:38

Implement the keywords function to determine keywords from text.

What steps will reproduce the problem?
1. We need to be able to find the keywords from the body text and tags.
2. We will reference words against a list of common English words.

What is the expected output? What do you see instead?
A list of 5-10 keywords per article.

Please use labels and text to provide additional information.
Add more links as we find information.

Original issue reported on code.google.com by [email protected] on 1 May 2012 at 7:02

Insert list of common words into the database so it isn't hard-coded.

What steps will reproduce the problem?
1. Find a good list of the top ~2000 common words.
2. Insert those words into a table in the database.
3. Refactor code to call a new function that populates the list of common words 
in Utilities.py.

What is the expected output? What do you see instead?
A list of common words that is called through a function in Utilities.py.

Please use labels and text to provide additional information.

Original issue reported on code.google.com by [email protected] on 15 May 2012 at 2:07

Merged into: #11

Need to implement a method to follow robots.txt

What steps will reproduce the problem?
1. We get blocked from nytimes.com.
2. Strip url to its base and then add the extension robots.txt
3. Parse robots.txt and not go to any links within the disallowed section.

What is the expected output? What do you see instead?
Only search urls that are allowed under the robots.txt guidelines.

Please use labels and text to provide additional information.
http://www.nytimes.com/robots.txt provides additional information.

Original issue reported on code.google.com by [email protected] on 1 May 2012 at 3:31

Parse accurate article submission dates

What steps will reproduce the problem?
1. Parse a news source once a week (or less often)
2. Parse an article and add its submission date as the date parsed
3. Old articles will have inaccurate submission dates

What is the expected output? What do you see instead?
Article table should contain date article was written, but instead it could 
contain a date weeks later than the correct date.  Database submission date 
should only be used as a fallback in the case that the news source does not 
have accurate meta-information about dates.

Please use labels and text to provide additional information.
Date article time parse.

Original issue reported on code.google.com by [email protected] on 16 May 2012 at 3:48

Begin crawling other news sources

What steps will reproduce the problem?
1. Currently our web crawler can only handle articles on nytimes.com.
2. Begin crawling aljazeera.com, bbc.co.uk, and thesun.co.uk/.

What is the expected output? What do you see instead?
We hope to be able to retrieve article information from numerous news sources.

Please use labels and text to provide additional information.
http://www.aljazeera.com/
http://www.thesun.co.uk/
http://www.bbc.co.uk/

Original issue reported on code.google.com by [email protected] on 4 May 2012 at 12:00

Improve algorithm for determining what is an article

What steps will reproduce the problem?
1. Research more into what defines an article from a non-article in various 
websites.
2. Review how WebsiteCrawler.py distinguishes articles.

What is the expected output? What do you see instead?
We currently do only simple checks on the file name of each link on a web page 
to check if it is an article and then ensure it has a title, description, and 
url like all articles have. We would like to gather more information on how to 
improve this process to increase the number of news sources and articles we can 
parse.

Original issue reported on code.google.com by [email protected] on 30 May 2012 at 9:47

MySql connection error

What steps will reproduce the problem?
1. Set the default socket of mysql to /rc12/d04/knowcse2/mysql.sock:
ini_set('mysql.default_socket', '/rc12/d04/knowcse2/mysql.sock');
2. Establish mysql connection:
mysql_connect("localhost:32001", 
                       "root", 
                       "purple pony disco");

What is the expected output? What do you see instead?
The expected behavior:
The index.php page successfully connects to the database and pulls out urls 
from user_list table in the database.
The observed behavior:
SQL connection error: Can't connect to local MySQL server through socket 
'/rc12/d04/knowcse2/mysql.sock'.

Original issue reported on code.google.com by [email protected] on 16 May 2012 at 5:13

Byte strings are encoded in utf-8 and needs to be converted back to normal.

What steps will reproduce the problem?
1. Tags from the article data do not format all characters properly.
2. Look up how to decode the byte code.

What is the expected output? What do you see instead?
In the case of \xe2\x80\x99 we want an '. We need to convert all of these byte 
strings into readable sentences/words.

Please use labels and text to provide additional information.
http://stackoverflow.com/questions/873419/converting-to-safe-unicode-in-python

Original issue reported on code.google.com by [email protected] on 1 May 2012 at 6:52

Figure out why some websites cannot be searched

What steps will reproduce the problem?
1. Look into issues with mechanize.
2. Attempt to appear as a browser and reattempt a connection.
3. Look into other crawling libraries such as beautiful soup to see if they 
have similar issues.

What is the expected output? What do you see instead?
We cannot retrieve any data from the homepage of aljazeera.com. We can in the 
case of nytimes.com.

Please use labels and text to provide additional information.
http://www.aljazeera.com/

Original issue reported on code.google.com by [email protected] on 3 May 2012 at 11:56

Allow ProcessDispatcher to accept a list of arguments to parse

What steps will reproduce the problem?
1. Currently, we are only able to run one news source at a time.
2. Allow for either a list of sources or a file with a list of sources.

What is the expected output? What do you see instead?
Runs the given list of news sources.

Please use labels and text to provide additional information.
Research optparse usage.

Original issue reported on code.google.com by [email protected] on 12 May 2012 at 9:46

Add list of common words to the database and implement the list in Utilities

What steps will reproduce the problem?
1. We currently have the list hardcoded.
2. Save the list into the database and call a function to retrieve data.

What is the expected output? What do you see instead?
Have the data stored in the database.

Please use labels and text to provide additional information.

Original issue reported on code.google.com by [email protected] on 13 May 2012 at 4:50

Crawler only runs on ovid01 locally.

Due to problems that arose in configurations and permissions on ovid01 when we 
migrated the database from the old ovid21 computer to the new ovid01 computer 
(as requested by the Know group), the webcrawler can only connect to the 
database on ovid01, and as such must be run from ovid01.

The Know group was notified of this issue, and we are waiting on a response 
before we can decide how to proceed on fixing the issue. For now, just run the 
crawler on ovid01.

Original issue reported on code.google.com by [email protected] on 30 May 2012 at 10:33

Database table for locations could use improvement.

What steps will reproduce the problem?
1. The table has improperly formatted locations.
2. The table is lacking some important locations such as key cities.

What is the expected output? What do you see instead?
We would like to see the table have all relevant locations to the KNOW project 
and not contain empty rows.

Please use labels and text to provide additional information.
We work around the issue of poor formatting of the database table by stripping 
locations of white space and using ascii decoding.

Original issue reported on code.google.com by [email protected] on 30 May 2012 at 9:40

Process relative urls

What steps will reproduce the problem?
1. If the url does not start with http:// then it is a relative url link.
2. Add functionality to allow processing of websites that use relative urls in 
their links.

What is the expected output? What do you see instead?
Be able to process relative urls on various news sources.

Please use labels and text to provide additional information.
http://www.aljazeera.com/ uses relative urls in their links.
http://www.nytimes.com/ uses absolute urls though. The current implementation 
only processes absolute urls.

Original issue reported on code.google.com by [email protected] on 1 May 2012 at 3:34

The time tag is picking up the usageTerms tag

What steps will reproduce the problem?
1. Run WebsiteCrawler.
2. Test to see how get_tag_by_name in ArticleParser is run.

What is the expected output? What do you see instead?
We expect to see time information, not the usage terms.

Please use labels and text to provide additional information.
nytimes.com queries link to 
http://www.nytimes.com/content/help/rights/sale/terms-of-sale.html.

Original issue reported on code.google.com by [email protected] on 1 May 2012 at 6:42

Output messages to log file instead of console

What steps will reproduce the problem?
1. Run the web crawler from command line
2. Messages sent to console output are hard to view

What is the expected output? What do you see instead?
Writes messages to log.txt, allowing for better diagnoses of crashes and 
analysis of performance.

Please use labels and text to provide additional information.
Log output console analysis.

Original issue reported on code.google.com by [email protected] on 16 May 2012 at 2:44

Connection to webpage is insecure

Trying to connect to the webpage gives a "Connection Insecure" dialog. This can 
be ignored as the website does not have any inputs that can damage the internal 
database.

Hit "Load Anyway" to continue operation.

Original issue reported on code.google.com by [email protected] on 31 May 2012 at 12:22

Use parallel processing in ProcessDispatcher

What steps will reproduce the problem?
1. We currently use only one process to run the webcrawler.
2. We need to divide the work among the cores available to us.
3.

What is the expected output? What do you see instead?
Run ProcessDispatcher.py in parallel.

Please use labels and text to provide additional information.

Original issue reported on code.google.com by [email protected] on 15 May 2012 at 2:15

Further remove ascii decode ampersands from article entries

When decoding information from articles, some special characters are decoded 
into ampersands such as &990;. While this will not effect queries that use the 
"like" command, it would be useful to strip all text of unnecessary characters.

Original issue reported on code.google.com by [email protected] on 31 May 2012 at 2:43

Normalize date tags

What steps will reproduce the problem?
1. Review date tags from multiple websites.
2. Review standards from nytimes.com.
3. Produce regular expressions to identify the format used on each website.

What is the expected output? What do you see instead?
A normalized date string in the form yyyy-mm-dd.

Please use labels and text to provide additional information.
2012-04-26 is the standard.
Be able to process formats such as April 26, 2012.

Original issue reported on code.google.com by [email protected] on 1 May 2012 at 1:05

Allow for processing of urls with case insensitivity

What steps will reproduce the problem?
1. Having urls with varying case sensitivity.

What is the expected output? What do you see instead?
Allow for processing of any case sensitivity.

Please use labels and text to provide additional information.
http://www.nytimes.com/ or HTTP://WWW.NYTIMES.COM/ or http://www.NYTIMES.com/ 
etc.

Original issue reported on code.google.com by [email protected] on 1 May 2012 at 12:54

Patch for /public_html/index.css

small change is done.

Original issue reported on code.google.com by [email protected] on 30 May 2012 at 11:47

Attachments:

index.css.patch

Need to check when to utilize keywords function

What steps will reproduce the problem?
1. Run the keywords function in ArticleParser on any article.

What is the expected output? What do you see instead?
Sometimes the keywords tag for an article is poorly written, in that case we 
should replace those keywords with our own list.

Please use labels and text to provide additional information.
Check ArticleParser.py

Original issue reported on code.google.com by [email protected] on 15 May 2012 at 2:05

Patch for /public_html/index.css

This is a small change.

Original issue reported on code.google.com by [email protected] on 30 May 2012 at 11:49

Attachments:

index.css.patch

nolancash / know-crawler Goto Github PK

know-crawler's People

Watchers

know-crawler's Issues

Recommend Projects

Recommend Topics

Recommend Org