I'm trying to understand the definition of data catalog and data schema. After several articles I read, I'm getting even more confused!
I realized when the article is talking about data lake, the term will be used to describe data catalog
as containing metadata information within data lake
. As obvious as the name, it is a catalog of all information within data lake it self.
Now, as far as I know, data catalog is database. Whereas data schema
is described as containing metadata information within a database
. Which means data catalog
in data lake is equal to data schema
in database.
Both seems to have the same purpose such as data discovery.
EXCEPT the terms are used in different level. One for data lake and the other one is database.
My questions are :
- Isn't the terms used inconsistent and confusing? What is the right term I should be using for each level?
- If these two are different, how are they different? Please enlighten me..
Best Answer
This has a regular structure ("schema"):
Note the rigid "rows" and "columns". The columns have specific datatypes" (string, number, etc). Etc.
The following has some structure, but is also very open-ended and "self-defining".
The former example is the key design of all RDBMSs. It is the primary focus of MySQL, Postgres, etc. It is designed and implemented for speed and scaling. The "catalog" and "schema" define the structure separate from the data.
The latter example is less structured, less performant, but more flexible. Perhaps this is the "data lake" you are asking about.
I suggest you focus on RDBMS and don't confuse yourself with "other" ideas.