Mysql – SQL vs NoSQL: Fetching All records in DynamoDB vs SQL Database

database-designdynamodbMySQLnosqlpostgresql

I am designing database for a project which requires frequent querying of all the records. The total number of records will be less than 500K. One of the databases I am considering is AWS DynamoDB where scan operation meant for this is very inefficient. What I fail to understand is whether it is inefficient in a SQL based database or there is one better than other for frequent calls for fetching all records.

Best Answer

Scan operations aren't usually optimal but 500,000 records isn't something too crazy for the standard RDBMS. For example, I've seen Microsoft SQL Server serve up 1 million records on a decently wide table (about 100 columns) in a few seconds on a pretty basic server (4 core, 32 GB of RAM).

I'm not an expert on NoSQL databases or DynamoDB but I've researched into its use cases in the past and my understanding is it's most performant when caching smaller amounts of data, and recalling smaller "singletonish" datasets, such as paging product information on a catalog, or a specific user's settings. So I would say NoSQL databases (or at least DynamoDB) is likely not great for scanning large amounts of data, but a modern RDBMS should be able to handle scanning a few million records on basic architecture these days.

Related Solutions

Mysql – What performance should I expect from MySQL (Using Amazon RDS)

Since there are few different values for pk2 column, it might be more efficient to have an index with the reverse order (pk2, pk1):

ALTER TABLE `table`
  ADD INDEX tIX 
    (pk2, pk1) ;

and then use the query:

SELECT e3.pk2 
FROM `table` e3 
GROUP BY e3.pk2 
HAVING EXISTS
       ( SELECT 1
         FROM `table` t
         WHERE t.pk2 = e3.pk2
           AND t.pk1 BETWEEN '2013-09-24 12:27:00.0'
                         AND '2013-11-29 12:27:00.0'
       )
-- ORDER BY NULL ;

Depending on the size of the tables and the values of the datetime range, you may get a plan that uses the primary index in the dependent subquery:

ID SELECT_TYPE     TABLE TYPE  POSSIBLE_KEYS KEY KEY_LEN REF ROWS EXTRA
1  PRIMARY            e3 range (null)        tIX    4  (null) 61  Using index for group-by
2  DEPENDENT SUBQUERY t  range PRIMARY,tIX  PRIMARY 4  (null)  4  Using where; Using index

But with a larger range or larger table (we can't test too many rows at SQLFiddle), see the SQL-Fiddle, and another possible execution plan. This one I expect to be more efficient when you have few distinct values in the pk2 column. With 318 distinct values, it will do 318 lookups using the newly added index which will be more efficient than the previous plan that first gets (possibly millions) rows that match the datetime range condition and then finds the DISTINCT pk2 values among them:

ID SELECT_TYPE     TABLE TYPE  POSSIBLE_KEYS KEY KEY_LEN REF ROWS EXTRA
1  PRIMARY            e3 range (null)        tIX    4  (null) 61  Using index for group-by
2  DEPENDENT SUBQUERY t  range PRIMARY,tIX   tIX    4   func  52  Using where; Using index

DynamoDB Data Storage – Locate Data on Local Computer

The search is made on the server host where the database service is installed. And you will only get the result set sent to you.
It would make no sense your client to make this, as your client is just an interface that allows your to interact with your database.
This is available for all databases.

Best Answer

Related Solutions

Mysql – What performance should I expect from MySQL (Using Amazon RDS)

DynamoDB Data Storage – Locate Data on Local Computer

Related Question