Name    - Intelli-Aggie
Author  - Srijith.K (srijith/@/srijith.net)
Version - 1.0 Pre-Alpha
License - GPL


About
-*-*-*
This is the pre-alpha version of Intelli-Aggie.
Intellie-Aggie is a proof-of-concept RSS feed
aggregator and sorter that does the following:

*)Fetches user defined RSS feeds
*)Categorises news items in the feeds into user
defined categories based on user defined keywords.
*)Generates list of these news items grouped in
various views.

*)The most novel thing about Intelli-Aggie is that 
the system tries to adapt according to the reader's
reading preference, trying to show him/her more
interesting and relevant news items first.


Design Principles
*-*-*-*-*-*-*-*-
As you might have seen in the fetchrss.ini file, there are a lof of
parameters that can be tweaked for personal preference. All these
parameters play a role in the output of the order of news items
in the three HTML files. The three files are created relevance based,
keyword based or blog/source based.


The basic design idea is that of:

Category Adaptation -Eg. A user who reads more technology news should be shown more
technology news.

Source Adaptation- A user who reads more news posted by a blogger 'A'
will be more interested in news items by the same person.

Cleanup - An item that has been read or not read for a long time should
be automatically removed.

To get this result, we use "relevance" and "weights", lots of them! 


Relevances
---------
They are used to convey a relative preference. If you like feed A better than
feed B, you give feed A a higher relevance than feed B etc.

*)"cat_relv" - Used to give relative relevance to categories. If you are a techie,
you may want to give "Technology" category a relevance of 75 and "Politics" a relevance
of 50.

*)"feed_relv" - To give relative preference for source of information. If you think you 
like Wired stories than Rediff, give Wired a higher preference.

*)"keyword_relv" - How much of relevance is the presence of this keyword to the category.
If I feel that any story that has the word "Wireless" in it should go to its associated
category "Wi-Fi", I would give it a relevance of 100. If, however, I feel that a keyword "MP3"
may be associated with category "Technology" only to a small extent (the news item could be 
actually belonging to category Entertainment), I would give it a relevance of only 50.

*)"item_relv" - This is the only automatically calculated relevance. More on this later,
after we discuss "weights".


Weights
-----------
Weights are specified to help in making decisions. There are two major decision to be made
for Intelli-Aggie (all weights are <=1)

(I)Item categorization - When the feed item is retrieved, under which category should the
item be put it in? The weights to be used in this process are:

	*)"keyword_in_title_weightage" -  If I find a particular keyword in the title of the item,
	I am pretty sure it should be filed in the category of the keyword. So, I give the value '1'
	to this weightage.

	*)"keyword_in_subject_weightage" - The same goes for the subject section of the item/news.

	*)"keyword_in_description_weightage" - If the keyword appears in the keyword, then it might
	not be so close a match. So I give a weigh of only 0.75

	*)"feed_category_inheretance_weightage" - Each feed is assigned a category when it is added
	to the system. So, any item from this feed should be in some way related to that category.
	What weight to give for the feed category to influence the item categorization process?
	I give 0.50 because I have inherent mistrust in most blogger sources. I let the keywords
	do the work for me, rather than the theme of the source.

-------------------The Item Categorization Process -----------------------------------------
THUS, for each item, a sum is calculated as:
for each category {
	if (sourc feed = this category) {
		sum = feed_category_inheretance_weightage*100
	}
	for each keyword assigned to this category {
		sum = sum + (keyword_in_title_weightage OR keyword_in_subject_weightage OR keyword_in_description_weightage)*keyword_relv 
            	/* Depening on which part of the item matched the keyword */
	}
}

The category with the biggest sum is assigned to the item.
-------------------------------------------------------------------------------------------


(II)Item relevance calculation - Each item has to given a relevance (i.e. relative mportance)
We mentioned this earlier as "item_relv". For this decision process we define the following
weights:
	
	*)"feed_w" - How important if the source to decide the importance of the item. If you are
	like, since I have mistrust in sources, I assign a small value like '0.4' to it. If you
	really believe that you will like all the items from a particular feed, say 'A', just assign
	a large value (eg. 95) to the feed_relv, rather than make this feed_w larger.

	*)"cat_w" - How important is the category of the item in deciding the importance of the feed.
	My taste in feeds are category based. I like Tech stuff more than political, a lot. So, for me
	the category of a news item plays a very important role in decising the imporatnce of the item.
	So, I give this a larger value like 0.7

	*)"cat_inherit_from_feed_w" - However much I like the item's category to influence the item's
	importance, there are cases when the item may not match any keyword at all and its cactegory
	is decided soley based on the source's category. Given my (now very clear) mistrust in source
	I would give this a smaller value than "cat_w", like 0.5
	
	So, if the category of the item is decided by keywords, cat_w is used. If no match occurs with
	the keywords and the category of the item is based solely on the feed category,"cat_inherit_from_feed_w"
	iss used for weight.

	*)"keyword_match_w" - This is a slighly different weight. This is the only weight that is given 
	integer values and not used as a scale. The idea is this. Even though an item might eventually
	be categorised as belonging to "Software", it might have contained occurances of keywords belonging
	to the category "Wireless". In the end, I am more likely to be interested in an item that has more
	keyword matches, irrespective of the category of the keyword. So, every time an item contains
	an occurances of a keyword, I add this value "keyword_match_w" to its importance. Keep it a small
	value, like 1.


-------------------------------------------------------------------------------------------------------------------
THUS, the relevance of an item is calculated as:

if (there as been at least one keyword match for the item) {
	overall_relv=(feed_w * feed_relv) + (cat_w * cat_relv) + keyword_match_w*no_of_matches_of_any_keyword;
}
else {
	$overall_relv=(feed_w * feed_relv) + (cat_inherit_from_feed_w * cat_relv);
}
--------------------------------------------------------------------------------------------------------------------


Other than the categorization and relevance calculation of the item, we have two more important processes.

(I)Cleanup
Any read item has to be deleted from the list fast. If an item has not been read for some time, it most
probably means that the item is not too interesting. So, first decrease its relevance. If it is still
not read after some more time, just delete it. However, there might be come items that stands out,
with a very high relevance. Consider them holy and don't to anything to them. All these are done using
more user definable parameters:

	*)"read_hr_thresh" - Delete a read item this much time after its fetch. Don't put a very small value
	like 1 as you might want to re-read them later.


	*)"dec_relv_hr_thresh" - Time to wait to decrease the relevance of an unread item.
	
	*)"notread_hr_thresh" - Time to wait before completely deleting an unread item. By logic, keep this
	bigger than "dec_relv_hr_thresh"

	*)"no_delete_relv_thresh" - If an item has been given relevance more than this, it is most likely that
	it is very special. Consider it holy and don't try to delete or decrease it relevance.

	*)"relv_dec_factor" - After 'decrelv_hr_thresh', change the item relevance to this much of the original value


(II)Adaptation
Based on the reading habit of the user, the system tries to update the relevances associated with the
feed and the category, using these parameters:

	*)"feed_relv_increment_cgi" - This is the amount added to feed_relv every time you read an item
	from this source. Keep this a small values, otherwise you might loose the relative importance between feeds.

	*)"cat_relv_increment_cgi" - This is the amount added to cat_relv when you read an item which falls
	in a particular category.

At the same time, the system has to adjust using knowledge of what the user is NOT reading too. These parameters
help:

	*)"feed_relv_adjust_hr_thresh" - If no item from a feed has been read for this much time, it is time to
	decrease its relevance.

	*)"cat_relv_adjust_hr_thresh" - If no item from a category has been read for thi smuch time, time to
	decrease the category's relevance.

	*)"feed_relv_decrement" - Once it is time to decrease the feed relevance, how much should it be decreased
	by. It is an absolute number, not a ratio. i.e new feed relv = old feed relv - feed_relv_decrement

	*)"cat_relv_decrement" - Similar to "feed_relv_decrement" but for the category.


Manual Override
-----------------------
Sometimes, you might see an item very highly placed in relevance when you do not think it is worth
that much importance. When this happens, the user can click on the (-) next to the item, an a manual
decrease in the feed and category relevance is done, to make sure that similar items do not rise
so high in the relevance chart. The parameters involved are:

	*)"feed_relv_decrement_cgi" - By how much should the feed relevance be decreased.

	*)"cat_relv_decrement_cgi" - By how much should the category relevance be decreased.


---------------------------------------- ** END ** ---------------------------------------------------