I was involved in a project that required analyzing and mining information of interest from large network traces (spanning TB’s of data). This was my first time with network data of this magnitude so i learnt quite a few things by hit and trial. I decided to do a quick post to summarize my experience and ‘humble’ insights!
1. Whenever you get a dataset, don’t take the data-provider’s word for it. Do the following:
a): Find out what are the time settings on the machine on which the capture was taken.
Reason: All capture files actually maintain timestamps in terms of the standard epoch (The time in seconds since epoch (Jan 1, 1970 00:00:00)). However, any tool you use to view the dump file will change this time according to the local time settings on your machine. Most tools, including Wireshark even change the sequence of packets so don’t believe what you see in Wireshark and use your own script to verify timestamps.
b): Calculate the duration of capture by subtracting start timestamp from the end timestamp. If the duration makes sense, you are good to go, otherwise you’ll have to write a script to print interarrival times between packets and mark the ones in which this time is unacceptably large. You might want to investigate why.
c): Run a script to calculate incomplete handshakes. It will give you an idea what kind of data you have at hand.
d): Run a script to calculate data loss based on packet Seq and Ack numbers.
2. Split the data into 1 GB files. Can’t emphasize this enough. Barring the initial overhead involved in splitting the file (took about 10 hours to split 238 GB), it saves so much time and effort and makes debugging alot easier. If a script halts, you know where to restart or where to look for problem. Also, many so called state of the art tools tend to crash when presented with a large chunk of data.
3. You *may* want to further split data based on ibnound and outb ound traffic. Alternately, instead of physically seperating inbound and outbound traffic, you can implement filters in your script (i prefer this .
4. Make provision for statefulness in all your scripts. Print out useful information in log files so that even if the script crashes, you can pinpoint the problem.
5. Always test your script on a small chunk of data before unleashing it on the data giant.
6. Python+Scapy is a bad bad choice for parsing pcap files. In high level languages, Java’s jpcap is probably the best bet. A python script that took above 72 hours to parse 238 GB data, did the same in 2.30 hours when reimplemented in Java.
7. Enable remote access on your machines. Saves a great deal of time. But be ‘careful’ — don’t turn it into a hacker fiesta!
8. Replicate data and results wherever they can fit. Hard disks will fail, computers will crash and all hell will break loose the moment you decide to do anything worthwhile with your data.
Now that i am familiar with Bro IDS, i intend to do a re- of this post mentioning things you can use Bro to do for you. Why reinvent the wheel? See you again, soon-