Etymology
"Collation" is probably best defined on etymology.com,
late 14c., "act of bringing together and comparing,"
It hasn't changed at all in the past 600 years. "collate" means "to bring together" and whether it's the ordering of a book, chronologically or otherwise in a database, it's all the same.
Computer Science
As it applies to Computer Science, what we're doing is separating the storage mechanism from the ordering. You may have heard of ASCII-betical. That refers to a comparison based on the binary encoding of ASCII characters. In that system, storage and collation are two in the same. If the ASCII-standardized "encoding" ever changed the "collation" (order) would change too.
POSIX started to break that with LC_COLLATE
. But as we move into Unicode a consortium emerged to standardize collations as well: ICU.
SQL
In the SQL spec there are two extensions to provide collations,
F690
: “Collation support:
Includes collation name
, collation clause
, collation definition
and drop collation
.
F692
: Extended collation support,
Includes attaching a different default collation to a column in the schema.
Essentially these provide the ability to CREATE
and DROP
collations, to specify them for operators and sorts, and to define a default for columns.
For more information on what SQL has to offer check out,
Examples
Rather than pasting a limited example, here is the PostgreSQL test suite it's pretty extensive. Check out at least the first link and look for 'Türkiye' COLLATE "tr-x-icu" ILIKE '%KI%' AS "false"
Horizontal Scaling
Horizontal Scaling is essentially building out instead of up. You don't go and buy a bigger beefier server and move all of your load onto it, instead you buy 1+ additional servers and distribute your load across them.
Horizontal scaling is used when you have the ability to run multiple instances on servers simultaneously. Typically it is much harder to go from 1 server to 2 servers then it is to go from 2 to 5, 10, 50, etc.
Once you've addressed the issues of running parallel instances, you can take great advantage of environments like Amazon EC2, Rackspace's Cloud Service, GoGrid, etc as you can bring instances up and down based on demand, reducing the need to pay for server power you aren't using just to cover those peak loads.
Relational Databases are one of the more difficult items to run full read/write in parallel.
I saw Damien Katz speaking about CouchDB at StackOverflow DevDays in Austin and one of his primary focuses for its creation was these parallel instances. As this has been a focus of it since day one, it would be much more capable of taking advantage of horizontal scaling.
Vertical Scaling
Vertical Scaling is the opposite, build up instead of out. Go and buy the strongest piece of hardware you can afford and put your application, database, etc on it.
Real World
Of course, both of these have their advantages and disadvantages. Often times a combination of these two are used for a final solution.
You may have your primary database where everyone writes to and reads real time data on a large piece of hardware. Then have distributed read only copies of the database for heavier data analysis and reporting where being up to the minute doesn't matter as much. Then the front end web application may be running on multiple web servers behind a load balancer.
Best Answer
My video is about the different mappings that have to be made in order to map relations all the way down to hardware. In practice, these different mappings and linearization steps are often confused and mixed up. This is unfortunate as the different, often hard-coded, decisions taken for certain mapping steps then may hinder query performance later on. Sometimes a simple change in one of these mappings may lead to a completely new product line (e.g. "column stores", PAX/Parquet).
In database research the term "materialization" denotes any form of data storage, i.e. any operation that actually sets some bytes on any storage layer eventually. Examples include a deep copy, memory allocation (not to be confused with malloc()), replication, materialized views (rather than dynamic views), intra-pipeline materializations, but also any form of (partial) copies along the storage hierarchy.
In the video, I introduce (and simplify) the different mapping steps. A simplified view of the world in a database is that everything gets eventually stored to physical pages. physical pages is a fixed term in database research. But make sure you understand that it is merely an abstraction. It is a storage unit in a DBMS. We can safely ignore what happens with those physical pages (for the moment) when discussing certain concepts (like query processing). That is what I do at 9:26 in the video as this is not a course on hardware: I say the data from physical pages gets materialized to storage devices. Again: the latter is a much longer story, e.g. factor in ACID, in particular the "D", recovery, CC, ...
But note that physical pages are not the same as physical memory, rather a physical page is mapped to either a main-memory page (which is almost always a virtual page provided by virtual memory) or mapped to some other device, e.g. pages on a hard disk or SSD. Most devices are virtualized inside as well, e.g. some SSDs used RAID 5 inside.
Of course with virtual memory, snapshotting, and different forms of storage indirection the term is sometimes a bit hard to understand. Sometimes you believe that you materialize, but...
For instance, assume you fork a child process in unix. Looks like the process has physical copies of the data, right? No, it hasn't. Only through copy-on-write you will receive physical copies. So, sometimes the boundary between materialize and not materialize gets blurred.
Hope that helps.