nolancash / know-crawler Goto Github PK
View Code? Open in Web Editor NEWAutomatically exported from code.google.com/p/know-crawler
Automatically exported from code.google.com/p/know-crawler
What steps will reproduce the problem?
1. Run the tests in db_manager_test and db_test
2. Manually inspect the database until final release daily.
What is the expected output? What do you see instead?
Only articles with all the required information should be in the database.
Please use labels and text to provide additional information.
Original issue reported on code.google.com by [email protected]
on 15 May 2012 at 2:18
What steps will reproduce the problem?
1. Use the text body to find relevant locations in the article.
2. Check against a list of major cities/provinces/states/etc.
What is the expected output? What do you see instead?
A list of locations relevant to the article to be inserted into the database.
Please use labels and text to provide additional information.
Original issue reported on code.google.com by [email protected]
on 25 May 2012 at 9:38
What steps will reproduce the problem?
1. We need to be able to find the keywords from the body text and tags.
2. We will reference words against a list of common English words.
What is the expected output? What do you see instead?
A list of 5-10 keywords per article.
Please use labels and text to provide additional information.
Add more links as we find information.
Original issue reported on code.google.com by [email protected]
on 1 May 2012 at 7:02
What steps will reproduce the problem?
1. Find a good list of the top ~2000 common words.
2. Insert those words into a table in the database.
3. Refactor code to call a new function that populates the list of common words
in Utilities.py.
What is the expected output? What do you see instead?
A list of common words that is called through a function in Utilities.py.
Please use labels and text to provide additional information.
Original issue reported on code.google.com by [email protected]
on 15 May 2012 at 2:07
What steps will reproduce the problem?
1. We get blocked from nytimes.com.
2. Strip url to its base and then add the extension robots.txt
3. Parse robots.txt and not go to any links within the disallowed section.
What is the expected output? What do you see instead?
Only search urls that are allowed under the robots.txt guidelines.
Please use labels and text to provide additional information.
http://www.nytimes.com/robots.txt provides additional information.
Original issue reported on code.google.com by [email protected]
on 1 May 2012 at 3:31
What steps will reproduce the problem?
1. Parse a news source once a week (or less often)
2. Parse an article and add its submission date as the date parsed
3. Old articles will have inaccurate submission dates
What is the expected output? What do you see instead?
Article table should contain date article was written, but instead it could
contain a date weeks later than the correct date. Database submission date
should only be used as a fallback in the case that the news source does not
have accurate meta-information about dates.
Please use labels and text to provide additional information.
Date article time parse.
Original issue reported on code.google.com by [email protected]
on 16 May 2012 at 3:48
What steps will reproduce the problem?
1. Currently our web crawler can only handle articles on nytimes.com.
2. Begin crawling aljazeera.com, bbc.co.uk, and thesun.co.uk/.
What is the expected output? What do you see instead?
We hope to be able to retrieve article information from numerous news sources.
Please use labels and text to provide additional information.
http://www.aljazeera.com/
http://www.thesun.co.uk/
http://www.bbc.co.uk/
Original issue reported on code.google.com by [email protected]
on 4 May 2012 at 12:00
What steps will reproduce the problem?
1. Research more into what defines an article from a non-article in various
websites.
2. Review how WebsiteCrawler.py distinguishes articles.
What is the expected output? What do you see instead?
We currently do only simple checks on the file name of each link on a web page
to check if it is an article and then ensure it has a title, description, and
url like all articles have. We would like to gather more information on how to
improve this process to increase the number of news sources and articles we can
parse.
Original issue reported on code.google.com by [email protected]
on 30 May 2012 at 9:47
What steps will reproduce the problem?
1. Set the default socket of mysql to /rc12/d04/knowcse2/mysql.sock:
ini_set('mysql.default_socket', '/rc12/d04/knowcse2/mysql.sock');
2. Establish mysql connection:
mysql_connect("localhost:32001",
"root",
"purple pony disco");
What is the expected output? What do you see instead?
The expected behavior:
The index.php page successfully connects to the database and pulls out urls
from user_list table in the database.
The observed behavior:
SQL connection error: Can't connect to local MySQL server through socket
'/rc12/d04/knowcse2/mysql.sock'.
Original issue reported on code.google.com by [email protected]
on 16 May 2012 at 5:13
What steps will reproduce the problem?
1. Tags from the article data do not format all characters properly.
2. Look up how to decode the byte code.
What is the expected output? What do you see instead?
In the case of \xe2\x80\x99 we want an '. We need to convert all of these byte
strings into readable sentences/words.
Please use labels and text to provide additional information.
http://stackoverflow.com/questions/873419/converting-to-safe-unicode-in-python
Original issue reported on code.google.com by [email protected]
on 1 May 2012 at 6:52
What steps will reproduce the problem?
1. Look into issues with mechanize.
2. Attempt to appear as a browser and reattempt a connection.
3. Look into other crawling libraries such as beautiful soup to see if they
have similar issues.
What is the expected output? What do you see instead?
We cannot retrieve any data from the homepage of aljazeera.com. We can in the
case of nytimes.com.
Please use labels and text to provide additional information.
http://www.aljazeera.com/
Original issue reported on code.google.com by [email protected]
on 3 May 2012 at 11:56
What steps will reproduce the problem?
1. Currently, we are only able to run one news source at a time.
2. Allow for either a list of sources or a file with a list of sources.
What is the expected output? What do you see instead?
Runs the given list of news sources.
Please use labels and text to provide additional information.
Research optparse usage.
Original issue reported on code.google.com by [email protected]
on 12 May 2012 at 9:46
What steps will reproduce the problem?
1. We currently have the list hardcoded.
2. Save the list into the database and call a function to retrieve data.
What is the expected output? What do you see instead?
Have the data stored in the database.
Please use labels and text to provide additional information.
Original issue reported on code.google.com by [email protected]
on 13 May 2012 at 4:50
Due to problems that arose in configurations and permissions on ovid01 when we
migrated the database from the old ovid21 computer to the new ovid01 computer
(as requested by the Know group), the webcrawler can only connect to the
database on ovid01, and as such must be run from ovid01.
The Know group was notified of this issue, and we are waiting on a response
before we can decide how to proceed on fixing the issue. For now, just run the
crawler on ovid01.
Original issue reported on code.google.com by [email protected]
on 30 May 2012 at 10:33
What steps will reproduce the problem?
1. The table has improperly formatted locations.
2. The table is lacking some important locations such as key cities.
What is the expected output? What do you see instead?
We would like to see the table have all relevant locations to the KNOW project
and not contain empty rows.
Please use labels and text to provide additional information.
We work around the issue of poor formatting of the database table by stripping
locations of white space and using ascii decoding.
Original issue reported on code.google.com by [email protected]
on 30 May 2012 at 9:40
What steps will reproduce the problem?
1. If the url does not start with http:// then it is a relative url link.
2. Add functionality to allow processing of websites that use relative urls in
their links.
What is the expected output? What do you see instead?
Be able to process relative urls on various news sources.
Please use labels and text to provide additional information.
http://www.aljazeera.com/ uses relative urls in their links.
http://www.nytimes.com/ uses absolute urls though. The current implementation
only processes absolute urls.
Original issue reported on code.google.com by [email protected]
on 1 May 2012 at 3:34
What steps will reproduce the problem?
1. Run WebsiteCrawler.
2. Test to see how get_tag_by_name in ArticleParser is run.
What is the expected output? What do you see instead?
We expect to see time information, not the usage terms.
Please use labels and text to provide additional information.
nytimes.com queries link to
http://www.nytimes.com/content/help/rights/sale/terms-of-sale.html.
Original issue reported on code.google.com by [email protected]
on 1 May 2012 at 6:42
What steps will reproduce the problem?
1. Run the web crawler from command line
2. Messages sent to console output are hard to view
What is the expected output? What do you see instead?
Writes messages to log.txt, allowing for better diagnoses of crashes and
analysis of performance.
Please use labels and text to provide additional information.
Log output console analysis.
Original issue reported on code.google.com by [email protected]
on 16 May 2012 at 2:44
Trying to connect to the webpage gives a "Connection Insecure" dialog. This can
be ignored as the website does not have any inputs that can damage the internal
database.
Hit "Load Anyway" to continue operation.
Original issue reported on code.google.com by [email protected]
on 31 May 2012 at 12:22
What steps will reproduce the problem?
1. We currently use only one process to run the webcrawler.
2. We need to divide the work among the cores available to us.
3.
What is the expected output? What do you see instead?
Run ProcessDispatcher.py in parallel.
Please use labels and text to provide additional information.
Original issue reported on code.google.com by [email protected]
on 15 May 2012 at 2:15
When decoding information from articles, some special characters are decoded
into ampersands such as &990;. While this will not effect queries that use the
"like" command, it would be useful to strip all text of unnecessary characters.
Original issue reported on code.google.com by [email protected]
on 31 May 2012 at 2:43
What steps will reproduce the problem?
1. Review date tags from multiple websites.
2. Review standards from nytimes.com.
3. Produce regular expressions to identify the format used on each website.
What is the expected output? What do you see instead?
A normalized date string in the form yyyy-mm-dd.
Please use labels and text to provide additional information.
2012-04-26 is the standard.
Be able to process formats such as April 26, 2012.
Original issue reported on code.google.com by [email protected]
on 1 May 2012 at 1:05
What steps will reproduce the problem?
1. Having urls with varying case sensitivity.
What is the expected output? What do you see instead?
Allow for processing of any case sensitivity.
Please use labels and text to provide additional information.
http://www.nytimes.com/ or HTTP://WWW.NYTIMES.COM/ or http://www.NYTIMES.com/
etc.
Original issue reported on code.google.com by [email protected]
on 1 May 2012 at 12:54
small change is done.
Original issue reported on code.google.com by [email protected]
on 30 May 2012 at 11:47
Attachments:
What steps will reproduce the problem?
1. Run the keywords function in ArticleParser on any article.
What is the expected output? What do you see instead?
Sometimes the keywords tag for an article is poorly written, in that case we
should replace those keywords with our own list.
Please use labels and text to provide additional information.
Check ArticleParser.py
Original issue reported on code.google.com by [email protected]
on 15 May 2012 at 2:05
This is a small change.
Original issue reported on code.google.com by [email protected]
on 30 May 2012 at 11:49
Attachments:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.