Ubuntu – Python script: How to chop the output to a limited line size

command linepythonscriptstext processing

I am using the python script for separating the domain from the respective emails and then grouping emails as per their respective domain. The following script work for me:

#!/usr/bin/env python3
from operator import itemgetter
from itertools import groupby
import os
import sys

dr = sys.argv[1]


for f in os.listdir(dr):
    write = []
    file = os.path.join(dr, f)
    lines = [[l.strip(), l.split("@")[-1].strip()] for l in open(file).readlines()]
    lines.sort(key=itemgetter(1))
    for item, occurrence in groupby(lines, itemgetter(1)):
        func = [s[0] for s in list(occurrence)]
        write.append(item+","+",".join(func))
    open(os.path.join(dr, "grouped_"+f), "wt").write("\n".join(write))

I used : python3 script.py /path/to/input files
The input I gave was a list of emails and got the out as:

domain1.com,gemail1@domain1.com,email2@domain.com
domain2.com,email1@domain2.com,email2@domain2.com,email3@domain2.com

But what the problem am facing is because of the MongoDB limit. As MongoDB has limit of 16 MB of document size and single line in my output file is considered as 1 document by MongoDB and the line size should not go beyond 16 MB.
So what I want to have is the result should get limited to 21 emails per domain and if the domain has more emails then it should be printed on a new line with the rest emails (again if emails are exceeding 21 then newline with same domain name). I cam store duplicate data in the mongoDB.

So the final output should be something like the following:

domain1.com,email1@domain1.com,email2@domain1.com,... email21@domain1.com
domain1.com,email22@domain1.com,.....
domain2.com,email1@domain2.com,....

The dot (.) in the above example represents many text, which I chopped to make it simple to understand.
Hope this clarify my problem and hoping to get a solution for it.

Best Answer

New version

The script you posted indeed groups the emails by domain, with no limit in number. Below a version that will group emails by domain, but split the found list into arbitrary chunks. Each chunk will be printed into a line, starting with the corresponding domain.

The script

#!/usr/bin/env python3
from operator import itemgetter
from itertools import groupby, islice
import os
import sys

dr = sys.argv[1]
size = 3

def chunk(it, size):
    it = iter(it); return iter(lambda: tuple(islice(it, size)), ())

for f in os.listdir(dr):
    # list the files
    with open(os.path.join(dr, "chunked_"+f), "wt") as report: 
        file = os.path.join(dr, f)
        # create a list of email addresses and domains, sort by domain
        lines = [[l.strip(), l.split("@")[-1].strip()] for l in open(file).readlines()]
        lines.sort(key=itemgetter(1))
        # group by domain, split into chunks
        for domain, occurrence in groupby(lines, itemgetter(1)):
            adr = list(chunk([s[0] for s in occurrence], size))
            # write lines to output file
            for a in adr:
                report.write(domain+","+",".join(a)+"\n")

To use

  • Copy the script into an empty file, save it as chunked_list.py
  • In the head section, set the chunk size:

    size = 5
    
  • Run the script with the directory as argument:

    python3 /path/to/chunked_list.py /path/to/files
    

    It wil then create an edited file of each of the files, named chunked_filename, with the (chunked) grouped emails.

What it does

The script takes as input a directory with files like:

email1@domain1
email2@domain1
email3@domain2
email4@domain1
email5@domain1
email6@domain2
email7@domain1
email8@domain2
email9@domain1
email10@domain2
email11@domain1

Of each file, it creates a copy, like:

domain1,email1@domain1,email2@domain1,email4@domain1
domain1,email5@domain1,email7@domain1,email9@domain1
domain1,email11@domain1
domain2,email3@domain2,email6@domain2,email8@domain2
domain2,email10@domain2

(set cunksize = 3)