Mar
5

Playing with ALOT of network traffic

By Sheharbano  //  Techy Stuff  //  4 Comments

I was involved in a project that required analyzing and mining information of interest from large network traces (spanning TB’s of data). This was my first time with network data of this magnitude so i learnt quite a few things by hit and trial. I decided to do a quick post to summarize my experience and ‘humble’ insights!

1. Whenever you get a dataset, don’t take the data-provider’s word for it. Do the following:
a): Find out what are the time settings on the machine on which the capture was taken.

Reason: All capture files actually maintain timestamps in terms of the standard epoch (The time in seconds since epoch (Jan 1, 1970 00:00:00)). However, any tool you use to view the dump file will change this time according  to the local time settings on your machine. Most tools, including Wireshark even change the sequence of packets so don’t believe what you see in Wireshark and use your own script to verify timestamps.
b): Calculate the duration of capture by subtracting start timestamp from the end timestamp. If the duration makes sense, you are good to go, otherwise you’ll have to write a script to print interarrival times between packets and mark the ones in which this time is unacceptably large. You might want to investigate why.
c): Run a script to calculate incomplete handshakes. It will give you an idea what kind of data you have at hand.
d): Run a script to calculate data loss based on packet Seq and Ack numbers.

2. Split the data into 1 GB files. Can’t emphasize this enough.  Barring the initial overhead involved in splitting the file (took about 10 hours to split 238 GB), it saves so much time and effort and makes debugging alot easier. If a script halts, you know where to restart or where to look for problem. Also, many so called state of the art tools tend to crash when presented with a large chunk of data.

3. You *may* want to further split data based on ibnound and outb ound traffic. Alternately, instead of physically seperating inbound and outbound traffic, you can implement filters in your script (i prefer this Big Smile.

4. Make provision for statefulness in all your scripts. Print out useful information in log files so that even if the script crashes, you can pinpoint the problem.

5. Always test your script on a small chunk of data before unleashing it on the data giant.

6. Python+Scapy is a bad bad choice for parsing pcap files. In high level languages, Java’s jpcap is probably the best bet. A python script that took above 72 hours to parse 238 GB data, did the same in 2.30 hours when reimplemented in Java.

7. Enable remote access on your machines. Saves a great deal of time. But be ‘careful’ — don’t turn it into a hacker fiesta!

8. Replicate data and results wherever they can fit. Hard disks will fail, computers will crash and all hell will break loose the moment you decide to do anything worthwhile with your data.

Now that i am familiar with Bro IDS, i intend to do a re- of this post mentioning things you can use Bro to do for you. Why reinvent the wheel? Smile See you again, soon-

4 Comments to “Playing with ALOT of network traffic”

  • Get in touch if you’d like to talk about what you are ultimately trying to accomplish. One thing Bro is very good at is looking at *extremely* large amounts of traffic.

  • Very useful insights. I agree with most of them except the — as you know — the Scapy one Smile . Jpcap’s equivalent in Python is Pcapy/Impacket, not Scapy. Scapy is a much more powerful framework and that comes at a cost. Comparing Jpcap to Scapy is a bit like complaining about Photoshop’s loading time being longer than MS Paint’s.

    With that said, even Pcapy might be painfully slow (haven’t run it on our traces yet). But at least we won’t be comparing apples to oranges if we make such a comparison Smile .

    • I completely agree with you. This post was added for the sake of ‘posterity’ Smile. I haven’t tried Pcapy but did have a tryst with dpkt. dpkt is fast. Scapy is slow because it makes a ‘class’ for every packet. This makes packet parsing very easy but at the cost of speed. So Scapy makes sense for less data, but when volume of data is high, perhaps one should come to a lower level and settle down for the dpkts and pcapys of the world- Ofcourse, C is the ultimate choice when speed is the primary concern.

  • [...] I thought of writing a splitter in Python but my colleague’s aversion for using Python on large network traces coupled with lack of maintenance of libpcap bindings resulted in me going for C/libpcap directly. [...]

Leave a comment

Click to Insert Smiley

SmileBig SmileGrinLaughFrownBig FrownCryNeutralWinkKissRazzChicCoolAngryReally AngryConfusedQuestionThinkingPainShockYesNoLOLSillyBeautyLashesCuteShyBlushKissedIn LoveDroolGiggleSnickerHeh!SmirkWiltWeepIDKStruggleSide FrownDazedHypnotizedSweatEek!Roll EyesSarcasmDisdainSmugMoney MouthFoot in MouthShut MouthQuietShameBeat UpMeanEvil GrinGrit TeethShoutPissed OffReally PissedMad RazzDrunken RazzSickYawnSleepyDanceClapJumpHandshakeHigh FiveHug LeftHug RightKiss BlowKissingByeGo AwayCall MeOn the PhoneSecretMeetingWavingStopTime OutTalk to the HandLoserLyingDOH!Fingers CrossedWaitingSuspenseTremblePrayWorshipStarvingEatVictoryCurseAlienAngelClownCowboyCyclopsDevilDoctorFemale FighterMale FighterMohawkMusicNerdPartyPirateSkywalkerSnowmanSoldierVampireZombie KillerGhostSkeletonBunnyCatCat 2ChickChickenChicken 2CowCow 2DogDog 2DuckGoatHippoKoalaLionMonkeyMonkey 2MousePandaPigPig 2SheepSheep 2ReindeerSnailTigerTurtleBeerDrinkLiquorCoffeeCakePizzaWatermelonBowlPlateCanFemaleMaleHeartBroken HeartRoseDead RosePeaceYin YangUS FlagMoonStarSunCloudyRainThunderUmbrellaRainbowMusic NoteAirplaneCarIslandAnnouncebrbMailCellPhoneCameraFilmTVClockLampSearchCoinsComputerConsolePresentSoccerCloverPumpkinBombHammerKnifeHandcuffsPillPoopCigarette