Mysql – Problems splitting and aggregating data in SQL

aggregateMySQLstring-splitting

Hello I have the following data in csv file:

OrderID, OrderType, Amount, Price, Time
      1,       Buy,     10,     1, 10/01/2018 10:30 AM
      2,       Buy,      2,     1, 10/01/2018 11:30 AM
      3,      Sell,      5,     3, 10/01/2018 01:30 PM
      4,      Sell,      3,     6, 11/01/2018 10:30 AM
      5,       Buy,     13,     2, 11/01/2018 11:30 AM
      6,      Sell,      3,     3, 12/01/2018 10:30 AM
      7,      Sell,     11,     2, 13/01/2018 10:30 AM

What I need to do is I need to aggregate the records, so that I have matching amounts for buy and sell.

So I need to split the buy or the sell in order to have matching amounts.
like this:

ver1: (do nto modify buy records, just sell)

for OrderID=1 I have 10 buy but OrderID=3 (first sell) has just 5.
So what I have to do is search next sell (ID 4) 3, now the sum is still under 10, so I go to the next. If the next is taken at full I will have sum over 10, so I need to split OrderID=6 into two lines, one with the remaining to the 10 (2) and the second with the remaining (3-2=1). Then I look into OrderID=2. Amount is 2, from OrderID=6 I have 1 remaining that I need to use + 1 from (ID 7).

ver2: (keep sell orders and split buy)

In this case OrderID=1 has to be split into 5,3,2 (to be used in OrderID= {3,4,6})
Then OrderID=2 has to be split into two for (OrderID={6,7})
And OrderID=5 is ok to be used in full,s o no modification needed…

Is there a way to do all this in normal SQL or do I need some external program to manipulate the data accordingly?

My data is coming in csv, but I can import it in MySQL.
The program that I use to read the csv with sql is QlikView, but I am looking for general solution for general SQL or MySQL.

Best Answer

As I mentioned in comments on your original question, one option is to leave the original records alone, and have a separate table that you use to tie "buy"s to matching "sell"s.

CREATE TABLE reconcile
     ( id INT AUTO_INCREMENT
      ,buy_order_id INT
      ,sell_order_id INT
      ,amount INT);

Then, create a stored procedure something like this:

CREATE PROCEDURE reconcile_buy_sell()
BEGIN
  DECLARE c_buy CURSOR FOR
  SELECT t.OrderID, (t.`Amount` - rec.rec_amount) as `Amount`
    FROM MyTable t
           INNER JOIN (SELECT buy_order_id, SUM(`amount`) as rec_amount
                         FROM reconcile
                        GROUP BY buy_order_id
                      ) rec ON (t.OrderId = rec.buy_order_id)
   WHERE t.OrderType = 'Buy'
     AND t.`Amount` > rec.rec_amount
   ORDER BY t.`Time`
  ;

  DECLARE c_sell CURSOR FOR
  SELECT t.OrderID, (t.`Amount` - rec.rec_amount) as `Amount`
    FROM MyTable t
           INNER JOIN (SELECT sell_order_id, SUM(`amount`) as rec_amount
                         FROM reconcile
                        GROUP BY sell_order_id
                      ) rec ON (t.OrderId = rec.sell_order_id)
   WHERE t.OrderType = 'Sell'
     AND t.`Amount` > rec.rec_amount
   ORDER BY t.`Time`
  ;

  DECLARE buy_id, sell_id, buy_amount, sell_amount INT;
  DECLARE done INT DEFAULT FALSE;

  DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = TRUE;

  OPEN c_buy;
  OPEN c_sell;

  FETCH c_buy INTO buy_id, buy_amount;
  FETCH c_sell INTO sell_id, sell_amount;

  WHILE NOT done
  DO
    IF buy_amount < sell_amount THEN
      INSERT INTO reconcile (buy_order_id, sell_order_id, `amount`)
      VALUES (buy_id, sell_id, buy_amount);
      SET sell_amount = sell_amount - buy_amount;
      SET buy_amount = 0;
    ELSE
      INSERT INTO reconcile (buy_order_id, sell_order_id, `amount`)
      VALUES (buy_id, sell_id, sell_amount);
      SET buy_amount = buy_amount - sell_amount;
      SET sell_amount = 0;
    END IF;

    IF buy_amount = 0 THEN
      FETCH c_buy INTO buy_id, buy_amount;
    END IF;

    IF sell_amount = 0 THEN
      FETCH c_sell INTO sell_id, sell_amount;
    END IF;
  END WHILE;

  CLOSE c_buy;
  CLOSE c_sell;
END;

[Note: code is untested, and I'm not a MySQL expert - treat this as pseudocode, and feel free to correct any errors found.]

The two cursors will iterate through the "buy"s and "sell"s in the order in which they were entered. They only return orders where the full amount of the "buy" or "sell" has not already been reconciled in the reconcile table, and only return the unreconciled portion of the amount for those rows.

In the WHILE loop, we add rows to reconcile. If the amount of the current "buy" is less that the amount of the current "sell", then the full (remaining) amount for the buy goes into thereconcile` row; otherwise (whether the "sell" amount is less that or equal to the "buy" amount), we can use the amount for the sell.

We reduce the remaining amount for the current "sell" (or "buy", whichever is more) by the remaining amount for the "buy" (or "sell") we just used up, and set the "buy" (or "sell") amount to 0. If the amounts are the same, they'll both be 0 now.

Then, for whichever amounts are zero, we pull in the next record from the cursor.

When either cursor is out of records (including the degenerate case, where one or the other cursor returned no records), we stop reconciling.

NOTE: For performance purposes, it may be worth adding a reconciled bit column to MyTable (or whatever you actually call the table your CSV data gets imported to), and marking rows where the full amount has been reconciled as reconciled = TRUE. This could speed up locating the rows that aren't reconciled.

Related Solutions

MySQL – Efficiently Cloning Tens of Millions of Database Records

Simple approaches:

If your tables are indexed, drop the indexes and put them back after you load the data. It's generally going to be faster for MySQL to build the index on a table in one pass than to continually index the incoming data as you add it.
Run more than one copy of your script, each with a subset of the data source files, doing inserts in parallel
Enable compression if you're using InnoDB: http://dev.mysql.com/doc/refman/5.6/en/innodb-compression.html Typically we think of compression as making things smaller with a trade-off in speed... but my experience with InnoDB compression is that it can actually perform faster when you're dealing with a server that is blocking on I/O but has spare CPU capacity, because you can trade some of one for some of the other -- the time saved by not having to do disk read/write improves performance more than the extra CPU cycles for compress/decompress harms performance.

At some point, of course, you're always going to max out either CPU or I/O on your server, but running more than one process in parallel will help to some extent, especially if the rows aren't all going into just one table. If they are (I'm not familiar with the data set in question) then you do have a potential problem there, since some of these approaches won't help, but if not, then once you max your machine, you still have the option of marshaling multiple servers to do the work of initially creating the tables, and then copy them over to the final destination when you're done.

With MyISAM tables, the table spaces are intrinsically transportable from server to server. You can create the tables on one or more entirely different servers; run FLUSH TABLES WITH READ LOCK; and then (while keeping the tables locked by not closing your connection) you can actually copy the .frm and .MY* files from one server to the other. When you run SHOW TABLES after copying the files into the directory for the appropriate database, and setting the permissions, they just magically appear and are accessible, because MyISAM does not have an internal data dictionary. Everything is in those files.

In MySQL 5.6, InnoDB tablespaces became transportable also; it's not quite as simple as MyISAM, because the files have to be prepared first: http://dev.mysql.com/doc/refman/5.6/en/tablespace-copying.html ... but still potentially a viable way to get more work done in parallel.

You could also also leverage cloud servers to create and index these tables for you and then copy the tablespace files back to your production system, assuming you have sufficient bandwidth available to actually transfer the files... and you're willing to gamble just a little bit... the spot market price (for idle cloud capacity) for some very serious hardware (multi-core, with high-capacity SSDs, etc.) can be less than 1/10th of the on-demand (normal) price... if you go that route, you have to automate the process to move the completed files off of the cloud server as expeditiously as possible, since the machine can be interrupted at any time if the spot price goes above the amount you're willing to pay.

MySQL – Replacing Table Data with CSV

When populating the data there are a number of issues you may run in to. The data size itself is not huge, but more inconvenient... you could choose any of the mechnisms for populating it including regular selects.

The right way to be doing it is with the TRUNCATE followed by the LOAD. If you are having problems with indexes then you should take that up with the MySQL folk, it is a bug, and it should be working.

If you temporarily need a work around for it then I recommend that you add a column to your table which will be unique for each LOAD cycle. A date may be a good candidate. You have already chosen a column 'update', which is a fine column to use.... but, instead of changing the data back to 0 each time, just increment it for each load, or use a number that matches the date like 20140110 for example. Then, after you LOAD the data you can:

DELETE from TABLE where update <> 20140110

Best Answer

Related Solutions

MySQL – Efficiently Cloning Tens of Millions of Database Records

MySQL – Replacing Table Data with CSV

Related Question