Name - Intelli-Aggie Author - Srijith.K (srijith/@/srijith.net) Version - 1.0 Pre-Alpha License - GPL About -*-*-* This is the pre-alpha version of Intelli-Aggie. Intellie-Aggie is a proof-of-concept RSS feed aggregator and sorter that does the following: *)Fetches user defined RSS feeds *)Categorises news items in the feeds into user defined categories based on user defined keywords. *)Generates list of these news items grouped in various views. *)The most novel thing about Intelli-Aggie is that the system tries to adapt according to the reader's reading preference, trying to show him/her more interesting and relevant news items first. Design Principles *-*-*-*-*-*-*-*- As you might have seen in the fetchrss.ini file, there are a lof of parameters that can be tweaked for personal preference. All these parameters play a role in the output of the order of news items in the three HTML files. The three files are created relevance based, keyword based or blog/source based. The basic design idea is that of: Category Adaptation -Eg. A user who reads more technology news should be shown more technology news. Source Adaptation- A user who reads more news posted by a blogger 'A' will be more interested in news items by the same person. Cleanup - An item that has been read or not read for a long time should be automatically removed. To get this result, we use "relevance" and "weights", lots of them! Relevances --------- They are used to convey a relative preference. If you like feed A better than feed B, you give feed A a higher relevance than feed B etc. *)"cat_relv" - Used to give relative relevance to categories. If you are a techie, you may want to give "Technology" category a relevance of 75 and "Politics" a relevance of 50. *)"feed_relv" - To give relative preference for source of information. If you think you like Wired stories than Rediff, give Wired a higher preference. *)"keyword_relv" - How much of relevance is the presence of this keyword to the category. If I feel that any story that has the word "Wireless" in it should go to its associated category "Wi-Fi", I would give it a relevance of 100. If, however, I feel that a keyword "MP3" may be associated with category "Technology" only to a small extent (the news item could be actually belonging to category Entertainment), I would give it a relevance of only 50. *)"item_relv" - This is the only automatically calculated relevance. More on this later, after we discuss "weights". Weights ----------- Weights are specified to help in making decisions. There are two major decision to be made for Intelli-Aggie (all weights are <=1) (I)Item categorization - When the feed item is retrieved, under which category should the item be put it in? The weights to be used in this process are: *)"keyword_in_title_weightage" - If I find a particular keyword in the title of the item, I am pretty sure it should be filed in the category of the keyword. So, I give the value '1' to this weightage. *)"keyword_in_subject_weightage" - The same goes for the subject section of the item/news. *)"keyword_in_description_weightage" - If the keyword appears in the keyword, then it might not be so close a match. So I give a weigh of only 0.75 *)"feed_category_inheretance_weightage" - Each feed is assigned a category when it is added to the system. So, any item from this feed should be in some way related to that category. What weight to give for the feed category to influence the item categorization process? I give 0.50 because I have inherent mistrust in most blogger sources. I let the keywords do the work for me, rather than the theme of the source. -------------------The Item Categorization Process ----------------------------------------- THUS, for each item, a sum is calculated as: for each category { if (sourc feed = this category) { sum = feed_category_inheretance_weightage*100 } for each keyword assigned to this category { sum = sum + (keyword_in_title_weightage OR keyword_in_subject_weightage OR keyword_in_description_weightage)*keyword_relv /* Depening on which part of the item matched the keyword */ } } The category with the biggest sum is assigned to the item. ------------------------------------------------------------------------------------------- (II)Item relevance calculation - Each item has to given a relevance (i.e. relative mportance) We mentioned this earlier as "item_relv". For this decision process we define the following weights: *)"feed_w" - How important if the source to decide the importance of the item. If you are like, since I have mistrust in sources, I assign a small value like '0.4' to it. If you really believe that you will like all the items from a particular feed, say 'A', just assign a large value (eg. 95) to the feed_relv, rather than make this feed_w larger. *)"cat_w" - How important is the category of the item in deciding the importance of the feed. My taste in feeds are category based. I like Tech stuff more than political, a lot. So, for me the category of a news item plays a very important role in decising the imporatnce of the item. So, I give this a larger value like 0.7 *)"cat_inherit_from_feed_w" - However much I like the item's category to influence the item's importance, there are cases when the item may not match any keyword at all and its cactegory is decided soley based on the source's category. Given my (now very clear) mistrust in source I would give this a smaller value than "cat_w", like 0.5 So, if the category of the item is decided by keywords, cat_w is used. If no match occurs with the keywords and the category of the item is based solely on the feed category,"cat_inherit_from_feed_w" iss used for weight. *)"keyword_match_w" - This is a slighly different weight. This is the only weight that is given integer values and not used as a scale. The idea is this. Even though an item might eventually be categorised as belonging to "Software", it might have contained occurances of keywords belonging to the category "Wireless". In the end, I am more likely to be interested in an item that has more keyword matches, irrespective of the category of the keyword. So, every time an item contains an occurances of a keyword, I add this value "keyword_match_w" to its importance. Keep it a small value, like 1. ------------------------------------------------------------------------------------------------------------------- THUS, the relevance of an item is calculated as: if (there as been at least one keyword match for the item) { overall_relv=(feed_w * feed_relv) + (cat_w * cat_relv) + keyword_match_w*no_of_matches_of_any_keyword; } else { $overall_relv=(feed_w * feed_relv) + (cat_inherit_from_feed_w * cat_relv); } -------------------------------------------------------------------------------------------------------------------- Other than the categorization and relevance calculation of the item, we have two more important processes. (I)Cleanup Any read item has to be deleted from the list fast. If an item has not been read for some time, it most probably means that the item is not too interesting. So, first decrease its relevance. If it is still not read after some more time, just delete it. However, there might be come items that stands out, with a very high relevance. Consider them holy and don't to anything to them. All these are done using more user definable parameters: *)"read_hr_thresh" - Delete a read item this much time after its fetch. Don't put a very small value like 1 as you might want to re-read them later. *)"dec_relv_hr_thresh" - Time to wait to decrease the relevance of an unread item. *)"notread_hr_thresh" - Time to wait before completely deleting an unread item. By logic, keep this bigger than "dec_relv_hr_thresh" *)"no_delete_relv_thresh" - If an item has been given relevance more than this, it is most likely that it is very special. Consider it holy and don't try to delete or decrease it relevance. *)"relv_dec_factor" - After 'decrelv_hr_thresh', change the item relevance to this much of the original value (II)Adaptation Based on the reading habit of the user, the system tries to update the relevances associated with the feed and the category, using these parameters: *)"feed_relv_increment_cgi" - This is the amount added to feed_relv every time you read an item from this source. Keep this a small values, otherwise you might loose the relative importance between feeds. *)"cat_relv_increment_cgi" - This is the amount added to cat_relv when you read an item which falls in a particular category. At the same time, the system has to adjust using knowledge of what the user is NOT reading too. These parameters help: *)"feed_relv_adjust_hr_thresh" - If no item from a feed has been read for this much time, it is time to decrease its relevance. *)"cat_relv_adjust_hr_thresh" - If no item from a category has been read for thi smuch time, time to decrease the category's relevance. *)"feed_relv_decrement" - Once it is time to decrease the feed relevance, how much should it be decreased by. It is an absolute number, not a ratio. i.e new feed relv = old feed relv - feed_relv_decrement *)"cat_relv_decrement" - Similar to "feed_relv_decrement" but for the category. Manual Override ----------------------- Sometimes, you might see an item very highly placed in relevance when you do not think it is worth that much importance. When this happens, the user can click on the (-) next to the item, an a manual decrease in the feed and category relevance is done, to make sure that similar items do not rise so high in the relevance chart. The parameters involved are: *)"feed_relv_decrement_cgi" - By how much should the feed relevance be decreased. *)"cat_relv_decrement_cgi" - By how much should the category relevance be decreased. ---------------------------------------- ** END ** ---------------------------------------------------