Ubuntu – Is internet upload so high when I don’t actually upload much

bandwidthfirefoxgoogle-chromemonitoringnetworking

Total Monthly Internet Bandwidth

I use vnstat to monitor internet usage:

$ vnstat

                      rx      /      tx      /     total    /   estimated
 eth0:
       Jul '17    210.70 GiB  /   51.00 GiB  /  261.71 GiB
       Aug '17    275.79 GiB  /   70.54 GiB  /  346.33 GiB  /  348.91 GiB
     yesterday      5.47 GiB  /    2.08 GiB  /    7.55 GiB
         today      2.89 GiB  /    1.36 GiB  /    4.26 GiB  /    5.52 GiB

 wlan0:
       Jul '17         0 KiB  /       0 KiB  /       0 KiB
       Aug '17         0 KiB  /       0 KiB  /       0 KiB  /       0 KiB
     yesterday         0 KiB  /       0 KiB  /       0 KiB
         today         0 KiB  /       0 KiB  /       0 KiB  /      --    

I switched ISPs 6 months ago and the new ISP is picky on total monthly usage causing me to pay closer attention to stats.

Real Time Internet Usage

I checked the monitoring options in Ask Ubuntu and the answers point to nethogs which only reports KB/Sec by process which is inevitably Firefox or Chrome both reported in KB/Sec:

nethogs

This isn't helpful because I already know I use Chrome and Firefox. The question is "which tab?" or is it even a tab? Notice there are processes running as root? I never use sudo with Chrome or Firefox.

Investigative 5W's of Data Uploads

There are 5 W's:

  • Who is uploading 70 GB of data from my laptop each month? I backup daily to gmail.com which is 5.4 MB of scripts, documents, configuration settings and what not. That's 150 MB per month. Who's grabbing the other 69 GB?
  • What program is grabbing this data? I can't use a single process ID for Chrome or Firefox as an answer. I need to know the tab which points to the website. I can't use root and some random IP address as an answer.
  • Where is this data going? ie IP address.
  • When is this happening? Is it when I'm watching a movie? Watching internet news in Al-Jazeera or RT? Some sort of notification bubble on up-load volume would be nice.
  • Why? I don't need an answer to that question. The other 4 W's will suffice. It might be Vault 7 or it might not be. You can't sue the CIA and if you can't beat`em you should just block them.

Daily Internet habits

There are only six things I do daily on the internet:

  • Visit Ask Ubuntu and read Q&As. Uploads should be <1 MB/day because any answer I post is < 30 KB or update.
  • Watch Al-Jazeera.com live TV which uses HTML5 on youtube.com
  • Watch rt.com/on-the-air which uses Flash Player
  • Daily backup my scripts, documents and configuration files via e-mail to my gmail.com account and the .tar file is 5.4 MB.
  • Watch a movie on random websites in 1080p resolution when lucky, else 480p or 720p when not lucky.
  • Google search and visit websites to research Linux/Ubuntu related technical issues.

Summary

I'm familiar with Shift + Esc in Chrome to monitor network statistics in real time by Chrome Tab but something that runs in the background collecting statistics is preferable.

I haven't run Windows 8.1 in well over a month so the uploads aren't happening there. It's all in Linux/Ubuntu.

What can I do to narrow down my searching for the massive uploads?

Thank you for reading this far.

Best Answer

Note: This answer only addresses some of the desired "Investigative 5W's of Data Uploads".

Use tcpdump to capture all packet traffic, and use some post-processing to extract desired information.

sudo tcpdump -i enp4s0 -w 'ext-%F-%H-%M-%S.bin' -G 3600 -z /home/doug/bin/packet_post_processor2

Where:
my WAN facing interface is enp4s0;
The file names automatically include the date and time (requires an additional package, but I can not recall which);
I'm asking for file rotation once per hour;
Each file be post processed by the packet_post_processor script (2 is for this answer).

The post-processing script:

#!/bin/dash
#
# packet_post_processor2 Doug Smythies. 2017.09.08
#    Edits as required for updated c prgram, and bad sort order.
#    There may be little use in sort by packets count, but for now
#    it remians.
#
# packet_post_processor2 Doug Smythies. 2017.09.01
#    This script will be called from the always running tcpdump.
#    It is called for every binary file rotation.
#    The purpose is to make summary files of things that one
#    may want to investigate in more detail later on.
#
#    This version is for WinEunuuchs2Unix and
# https://askubuntu.com/questions/951783/how-to-find-out-who-is-taking-70-gb-of-data-from-me-each-month
#

#check that the call included the file name, and only the file name, to use.
if [ $# -ne 1 ]
then
  echo "Usage - $0  file-name"
  exit 1
fi

# check that the file actually exists:
if [ ! -f $1 ]
then
  echo "tcpdump binary file $1 does not exist, aborting..."
  exit 1
fi

echo "data extraction 1: All the packets..."
# Note: Using the -e option will ease subsequent bytes per unit time calculations
sudo tcpdump -n -tttt -e -r $1 >all_e.txt

echo "data extraction 2: The outgoing normal packets..."
# Note: We might want to check that something important doesn't get missed here.
# Note: replace the fake IP address with your actual IP address.
grep ": XXX\.XXX\.XXX\.XXX\." all_e.txt | grep Flags >outgoing.txt

echo "data extraction 3: Make a histogram of the destination IP addresses by packets..."
# Note: use field 13
cut -d" " -f13 outgoing.txt | sed 's/.[^.]*$//' | sort | uniq -c | sort -g >outhisto.txt

# Phase 2: Maximum packet count might not mean maximum byte count, so figure out maximum byte count

echo "data extraction 4: Sort the outgoing file by destination IP address."
LC_ALL=C sort -k 13 <outgoing.txt >outgoing.srt

echo "data extraction 5: Now, calculate bytes per IP and bytes per IP/16 and make sorted historgrams"
# Note: There might be some clever awk or whatever way to do this, but I have a c program.
./tcpdump_bytes outgoing.srt outb.txt out16.txt
sort --general-numeric-sort <outb.txt >outhistob.txt
sort --general-numeric-sort <out16.txt >outhistob16.txt

#Leave the intermidiate files, just for now, while we debug.
#
# packet_post_process. End.

The c program called from within the script:

    /*****************************************************************************
*
* tcpdump_bytes.c 2017.09.08 Smythies
*       By sorting the input file before running this program, it can do bytes
*       per IP all on its own, and in one pass through the file. At this time,
*       it is for outgoing only. A future revision will add command line
*       options for incoming and such.
*       Might as well group by 1st 2 IP address bytes at the same time,
*       i.e. for some (not all) of those multiple IP situations.
*
* tcpdump_bytes.c 2017.09.01 Smythies
*       Count the bytes for all the packets in the passed file.
*       See also tcpdump_extract.c, from which this was taken.
*       This program is very quite, just printing bytes, unless there
*       is some error. The idea is that is part of something bigger and
*       therefore extra verbosity would just get in the way.
*
*       Note: The input tcpdump file needs to have been done
*             with the -e option.
*
*****************************************************************************/

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define MAX_LENGTH 2000  /* maximum line length */

void main(int argc, char **argv){

   char in_buffer[MAX_LENGTH];
   char *infile, *outfile1, *outfile2;
   char *index, *index2;
   FILE *inf, *out1, *out2;
   unsigned current_bytes, sip3, sip2, sip1, sip0, sport, dip3, dip2, dip1, dip0, dport;
   unsigned dest_ip, dest_ip_16, dest_ip_old, dest_ip_16_old;
   unsigned num_lines, num_ips, num_16s;
   unsigned long long total_bytes, total_bytes_16;

   switch(argc){
   case 4:
      infile = argv[1];
      outfile1 = argv[2];
      outfile2 = argv[3];
      break;
   default:
      printf("tcpdump_bytes infile outfile1 outfile2\n");
      printf("  parse outgoing bytes per IP out of a sorted tcpdump file where the -e option was used.\n");
      printf("  infile is sorted tcpdump output file; oufile1 is bytes per IP; outfile 2 is bytes per IP/16.\n");
      exit(-1);
   } /* endcase */

   if((inf = fopen(infile, "rt")) == NULL){
      printf("Unable to open input file '%s'\n", infile);
      exit(-1);
   } /* endif */
   if((out1 = fopen(outfile1, "wt")) == NULL){
      printf("Error opening output file '%s'\n", outfile1);
      exit(-1);
   } /* endif */
   if((out2 = fopen(outfile2, "wt")) == NULL){
      printf("Error opening output file '%s'\n", outfile2);
      exit(-1);
   } /* endif */

   total_bytes = 0;
   total_bytes_16 = 0;
   dest_ip_old = 0;
   dest_ip_16_old = 0;
   num_lines = 0;
   num_ips = 0;
   num_16s = 0;

   while((fgets(in_buffer, MAX_LENGTH, inf)) != NULL){       /* do infile line at a time */
      num_lines++;

      if((index = strstr(in_buffer, "), length ")) != NULL){ /* find search string if it is there, then parse the data */
         sscanf(index, "), length %u: %u.%u.%u.%u.%u > %u.%u.%u.%u.%u:",
            &current_bytes,
            &sip3, &sip2, &sip1, &sip0,
            &sport,
            &dip3, &dip2, &dip1, &dip0,
            &dport);
      } else {
         printf("tcpdump_bytes: Got an odd line: %s", in_buffer);
      } /* endif */
      dest_ip_16 = (dip3 << 24) + (dip2 << 16);
      dest_ip = dest_ip_16 + (dip1 << 8) + dip0;
//    printf("debug: B: %u  S: %u.%u.%u.%u.%u  D: %u.%u.%u.%u.%u  %u  %u\n", current_bytes, sip3, sip2, sip1, sip0, sport, dip3, dip2, dip1, dip0, dport, dest_ip, dest_ip_16);

      if(dest_ip != dest_ip_old){
         if(total_bytes != 0){
            fprintf(out1, "%llu %u.%u.%u.%u\n", total_bytes, (dest_ip_old >> 24) & 0xff, (dest_ip_old >> 16) & 0xff, (dest_ip_old >> 8) & 0xff, dest_ip_old & 0xff);
            total_bytes = 0;
         } /* endif */
         dest_ip_old = dest_ip;
         num_ips++;
      } /* endif */
      total_bytes = total_bytes + (unsigned long long) current_bytes;

      if(dest_ip_16 != dest_ip_16_old){
         if(total_bytes_16 != 0){
            fprintf(out2, "%llu %u.%u.0.0/16\n", total_bytes_16, (dest_ip_16_old >> 24) & 0xff, (dest_ip_16_old >> 16) & 0xff);
            total_bytes_16 = 0;
         } /* endif */
         dest_ip_16_old = dest_ip_16;
         num_16s++;
      } /* endif */
      total_bytes_16 = total_bytes_16 + (unsigned long long) current_bytes;
   } /* endwhile */

   /* don't forget to output the last data */
   if(total_bytes != 0){
      fprintf(out1, "%llu %u.%u.%u.%u\n", total_bytes, dip3, dip2, dip1, dip0);
   } else {
      printf("tcpdump_bytes: Something is wrong. Last IP address has no bytes.\n");
   } /* endif */

   if(total_bytes_16 != 0){
      fprintf(out2, "%llu %u.%u.0.0/16\n", total_bytes_16, dip3, dip2);
   } else {
      printf("tcpdump_bytes: Something is wrong. Last IP/16 address has no bytes.\n");
   } /* endif */

   fclose(inf);
   fclose(out1);
   fclose(out2);
   printf("tcpdump_bytes: Done. Processed %d lines and %d IP addresses and %d /16 addresses\n", num_lines, num_ips, num_16s);
} /* endprogram */

Note that some files will get clobbered with the next hours processing. I'll fix that later.

A quick summary of what the post-processing script is doing:
First, the binary tcpdump file is converted to per packet summaries text. Example (my address has been changed to XXX.XXX.XXX.XXX):

2017-05-31 08:10:31.721956 00:22:b0:75:c2:bd > 6c:be:e9:a7:f1:07, ethertype IPv4 (0x0800), length 400: XXX.XXX.XXX.XXX.52779 > 38.113.165.77.443: Flags [P.], seq 1:347, ack 1, win 256, length 346
2017-05-31 08:10:31.826241 6c:be:e9:a7:f1:07 > 00:22:b0:75:c2:bd, ethertype IPv4 (0x0800), length 157: 38.113.165.77.443 > XXX.XXX.XXX.XXX.52779: Flags [P.], seq 1:104, ack 347, win 1026, length 103
2017-05-31 08:10:31.877945 00:22:b0:75:c2:bd > 6c:be:e9:a7:f1:07, ethertype IPv4 (0x0800), length 54: XXX.XXX.XXX.XXX.52779 > 38.113.165.77.443: Flags [.], ack 104, win 256, length 0
2017-05-31 08:10:32.603768 00:22:b0:75:c2:bd > 6c:be:e9:a7:f1:07, ethertype ARP (0x0806), length 42: Request who-has XXX.XXX.XXX.YYY tell XXX.XXX.XXX.XXX, length 28
2017-05-31 08:10:32.630960 6c:be:e9:a7:f1:07 > 00:22:b0:75:c2:bd, ethertype ARP (0x0806), length 60: Reply XXX.XXX.XXX.YYY is-at 6c:be:e9:a7:f1:07, length 46
2017-05-31 08:10:33.643468 00:90:d0:63:ff:00 > 01:00:5e:00:00:01, ethertype IPv4 (0x0800), length 60: 10.197.248.13 > 224.0.0.1: igmp query v2
2017-05-31 08:10:37.448732 00:22:b0:75:c2:bd > 6c:be:e9:a7:f1:07, ethertype IPv4 (0x0800), length 90: XXX.XXX.XXX.XXX.53120 > 91.189.89.199.123: NTPv4, Client, length 48 

It is on purpose that an ARP packet pair is included in the example, so show something that would be excluded from further processing.
The annoying IGMP packet from a private LAN IP is from my ISP and will also be excluded from further processing. However, if my ISP ever claims I have gone over my monthly data limit, I'll point to such packets when I say what I will not pay for. Notice two lengths shown on each line, the first one is bytes on the wire and the second one is payload length. We want bytes on the wire, and this is why we use the -e option with tcpdump.

Second, outgoing packet can uniquely be identified by finding ": XXX.XXX.XXX.XXX.", so extract all outgoing packets, not including ARP and ICMP, using grep.

Third, using space as a delimiter, field 13 is the destination IP address, so use a complicated bunch of piped commands to extract, count, and sort the destination IP address packets.

Forth, sort the outgoing packets by destination IP address.
Fifth, use the c program to calculate bytes per IP and bytes per IP/16 and sort the output into histograms.

Sixth, manually investigate the top IP addresses in an attempt to identify what is going on. Note that very often one can find the related forward lookup DNS inquiry in the tcpdump output.

As an example, I looked at my WAN/LAN data between 2017-05-31 08:09:33 and 2017-08-09 22:13:11 and edited in what I found for the various IP addresses.

First the top few by packet count:

packets IP Address      Added Comment
 299517 91.189.95.84    Ubuntu stuff
 301129 198.38.112.140  Netflix
 306815 17.253.31.206   Apple stuff
 319558 129.97.134.71   Ubuntu stuff (mirror, I think)
 333334 91.189.88.152   Ubuntu stuff
 352141 91.189.88.39    Ubuntu stuff
 353160 209.121.139.153 Telus (Microsoft updates streaming)
 368669 209.121.139.163 Telus (Microsoft updates streaming)
 389928 91.189.88.161   Ubuntu stuff
 396087 23.60.74.158    deploy.static.akamaitechnologies.com (?)
 421259 198.38.112.170  Netflix
 474506 17.253.31.205   Apple stuff
 477706 198.38.109.153  Netflix
 480452 198.38.112.159  Netflix
 540261 198.38.112.173  Netflix
 574592 198.38.112.132  Netflix
 710022 198.38.112.174  Netflix
 728434 209.121.139.144 Telus (Microsoft updates streaming)
 738839 198.38.112.130  Netflix
 883688 198.38.109.171  Netflix
1049778 198.38.112.154  Netflix
2166582 72.21.81.200    Hmmmm ? MCI Communications Services, (Skype, I think)
7512548 13.107.4.50     Microsoft (updates)

Second, the top few by byte count:

Bytes    IP                     Added Comment
32358580 17.253.31.205          Apple stuff
32625068 198.38.112.159         Netflix
34220805 172.217.3.206          Google web crawler
36628021 198.38.112.173         Netflix
37022702 17.188.208.132         Apple stuff
39105254 198.38.112.132         Netflix
40697177 209.121.139.144        Telus Microsoft updates file streaming
48247623 198.38.112.174         Netflix
49537980 64.4.54.254            Microsoft
50358753 198.38.112.130         Netflix
59623846 198.38.109.171         Netflix
71532166 198.38.112.154         Netflix
98480036 207.167.198.18         Telus e-mail stuff
139907010 72.21.81.200          Hmmmm ? MCI Communications Services, (Skype, I think)
210138801 91.189.95.84          Ubuntu stuff
325511064 204.79.197.213        Microsoft (?) msedge.net storage.skyprod.akadns.net
479586878 13.107.4.50           Microsoft (updates)

Notice how, since Netflix, for example, uses many IP addresses, it might fall lower in the ranking than it really should be, if all its IP addresses were treated as one.

Third, the top few /16 groups by bytes count. Notice how Netflix is now the biggest:

107592753 209.52.0.0/16         cache.google.com (for example)
116538884 207.167.0.0/16        Telus e-mail stuff
120769715 17.188.0.0/16         Apple. store-025-failover2.blobstore-apple.com.akadns.net (for example)
139261655 52.218.0.0/16         s3-us-west-2.amazonaws.com (for example) ? Hmmm...
147091123 172.217.0.0/16        Google web crawler
153146532 17.248.0.0/16         p46-keyvalueservice.fe.apple-dns.net. Apple iCloud Drive
183300509 72.21.0.0/16          Skype (I think)
213119564 209.121.0.0/16        Telus Microsoft updates file streaming
333374588 204.79.0.0/16         Microsoft
354346088 91.189.0.0/16         Ubuntu stuff
488793579 13.107.0.0/16         Microsoft (updates)
621733032 198.38.0.0/16         Netflix
Related Question