Mysql – Data Lake : Data Catalog and Data Schema definitions

data dictionarydatabase-designMySQLpostgresql

I'm trying to understand the definition of data catalog and data schema. After several articles I read, I'm getting even more confused!

I realized when the article is talking about data lake, the term will be used to describe data catalog as containing metadata information within data lake. As obvious as the name, it is a catalog of all information within data lake it self.

Now, as far as I know, data catalog is database. Whereas data schema is described as containing metadata information within a database. Which means data catalog in data lake is equal to data schema in database.
Both seems to have the same purpose such as data discovery.

EXCEPT the terms are used in different level. One for data lake and the other one is database.

My questions are :

  1. Isn't the terms used inconsistent and confusing? What is the right term I should be using for each level?
  2. If these two are different, how are they different? Please enlighten me..

Best Answer

This has a regular structure ("schema"):

id  name   color  cost
1   dress  green    40
2   camera red     200
3   car    black 34000

Note the rigid "rows" and "columns". The columns have specific datatypes" (string, number, etc). Etc.

The following has some structure, but is also very open-ended and "self-defining".

{name: "dress"; size: "large"},
{name: "camera"; pixels: "12M"; zoom: "4x"},
{name: "car"; style: "sedan"},
{name: "family"; members: ["father", "mother", "children"]}

The former example is the key design of all RDBMSs. It is the primary focus of MySQL, Postgres, etc. It is designed and implemented for speed and scaling. The "catalog" and "schema" define the structure separate from the data.

The latter example is less structured, less performant, but more flexible. Perhaps this is the "data lake" you are asking about.

I suggest you focus on RDBMS and don't confuse yourself with "other" ideas.