A statistical Method for Profiling Network Traffic 

NOTE : This is a fairly old paper, 1999 

David Marchette

General Gist : Try to describe mechanisms to profile activity profiles are machines on a network 

Angle : This paper is more about detecting anomolies for a group of machines owned by an organization. That is say 1000 machines 
owned by organization A, attempt to find unusual traffic and then investigate it. We find this unusual traffic by grouping machines
into activity groups. 

Value to me : Clustering is interesting and possibly valueable when gathering data

Verdict : I should read this again !

Data sets : March - April 1998 ( Naval Surface Warfare Center) filtered on machines that had more than 10 concurrent conns with two
			distinct ports
			(used a month to define "normal" traffic)
Problems : 

-> Filtering traffic based on interesting/uninteresting traffic to focus on interesting traffic 
-> What is abnormal ? 
	-> Traffic denied by firewall ?
	* The problem with this is we can only really consider "tricks" we have seen before
	-> Some conns may only be "bad" if connected to certain machines. That is a machine that doesnt normally do ftp, suddenly does. 
	   (maybe if it is an infrastructure machine 
	-> Paper focuses on 2 clustering techniques for 
		-> Cluster about 
				-> Date/Time
				-> TCP/UDP
				-> Dst Port/Addy
				-> SRC port/Addy
				-> Others could be incoprated such as packet size, flags etc
	

Measuring Network Traffic : 

-> Obvious way is to observe port frequency. High freq -> problem 
-> Paper shows an example with telnet frequency for a month 
-> Argue that "obviously" this sort of analysis can only be done for a small number of ports
	(I agree if this analysis is done by "hand" ...) 
-> Propose looking at the probability of a conn being from port x that is (freq of port x/ total packets) and compare this 
to other months
-> The paper suggests to consider low probability as unusual ( I contest this, any change from high to low, low to high that can be deemed significant should be investiged)
-> The idea of sessions -> sequential accesses from one source IP 
	-> Could count the sessions (see paper Computer Immonology[3] by Forrest) 
	-> Model a normal sessions and note deviations ( apparently the work is on going here) 
-> Goal is not to catch the bad guys but to filter for interesting traffic (meh ?) 
-> Researchers experience that even with the low prob their are too many problems to deal with ... 

Clustering by network traffic : 



-> Frequency works well with low no. machines
-> High number of machines -> aggregate to activity groups 
-> Suggest machines with similar functions have similar activity (mmm seems plausible) 
-> Clustering is stats technique. Most used is K-means

K-means method 

-> Suggests to decide on a number of clusters and procede with the K-mean 
	-> Determine the number of clusters viz Visualization, guess work and trial and error 
		(But they don't suggest a definite method how to determine the number of clusters) 
-> Basically they count for the first 1024 ports (udp/tcp). Then apply the k-means method and see where clusters are formed around 
-> Plots a graph of machine vs. port with the color of the dot representative of its probality , don't care about probabilities lower than 0.2
   (It is a really interesting idea) 
-> Use these probability vectors as to produce an activity vector for a group to classify conns 
-> Issues
	-> Estimating the strucutre of data in a spherical "clump" type manner -> fitting issues with large vectors
	
ADC method

-> Rather complicated model that uses distances from an intial vector to the data set, keeping the shortest distances
   (if I was to use this sort of method I would need to look into a proper text on it) 
 
Results 

-> Was easy to detect port scans even though it targets more than one port (however this is a very limited technique in that in wont help for scans of obscure ports)
-> Have a threshold to avoid normal activity (too high prob is excluded) 
-> ADC detected all attacks and reduced 90% of the data
-> K-means missed a few and misdiagnosed others


Conclusion

-> Advantages of technique
	-> Doesnt require a security expert
	-> doesnt ness require perfectly clean data (as the attack data shouldnt make up a sig portion of data sent) 
-> Disadvantages
	-> If clusters aren't homogenous then normal ports could get lower probs than what they should have 
	-> Can be very time intensive
-> More tests are required with these techniques to see if useful