for reference, almost all my DB experience is with mongodb and this is my first postgres project.
I'm storing data updates from decentralized finance crypto pools.
The data structure I'm exploring is very simple.
There are tokens:
type Token =
{
Address: string // primary key
Decimals: int
Symbol: string
}
This data will NEVER change but is referenced everywhere; roughly 2k entries
type LiquidityToken =
{
Address: string // primary key
Symbol: string
Token0Address: string // matches the address from a token above
Token1Address: string // matches the address from a token above
Timestamp: DateTime // changes at every update
Token0Amount: decimal // changes at every update
Token1Amount: decimal // changes at every update
}
this is updated every 5 mins, however ONLY Token0Amount and Token1Amount change and I want to keep a history.
I've roughly 500 of these objects with a new update every 5 minutes, so 140k new entries per day.
so the first thing I wanted to do is make a join between the Token0/1Address fields and the Token record, but since the Token just adds an int and a short (up to 32chars, but usually < 8 chars) string, I was wondering is this is doing much saving (I don't know how much space the join is taking).
then I was wondering if it is worth it to split the object into two pieces:
type LiquidityToken =
{
Address: string // primary key
Symbol: string
Token0Address: string // matches the address from a token above
Token1Address: string // matches the address from a token above
}
and
type LiquidityTokenUpdate =
{
Address: string // matches the address of the liquidity token
Timestamp: DateTime // changes at every update
Token0Amount: decimal // changes at every update
Token1Amount: decimal // changes at every update
}
and just replicate this object. But then now a query may pull the update, which pulls the liquidity token which has to pull 2 token objects.
Since this is for a public GraphQL interface, there can be quite a few usage patterns that will pop up. I can't predict a specific type of queries because this is something I want to make public for the community once it's ready.
I understand that caching is my friend here and since there static data is small it all fits in RAM, but overall would it make sense to try to save space (each record is not very large) and add the cost of the joins, or lookup in the local cache, or is it better to just save everything at each update since disk is cheap after all?
Please, put in context that this is not something I have experience with; I've read about joins, normalizing vs denormalizing tables, etc but I don't have practical experience with that. I dealt with very large data sets but in a non SQL context and this is my first time here, so I hope the question provides the info needed 😀
Best Answer
I'm not the one who downvoted your question, so I can't speak on behalf of that person, but my guess is because your question is a little abstract and wordy, and this site is usually for more direct and concretely worded questions, that was the downvoter's reasoning, because it's a little unclear exactly what your goals are / what you're asking. That being said, I agree it is annoying to be downvoted without explanation, so conversely you have my upvote, because your question is fair enough. ?
Anyways, what I can say is it sounds like the size of your data is of no concern here for PostgreSQL. A quick calculation shows that at the rate of new data you mentioned, you'd still be under half a billion records in a 10 year timespan. I wouldn't be even concerned about caching so much because your object is rather small (ergo a single row in the table isn't very wide and will be small).
If your
Token
object andTokenLiquidity
object have a one-to-one relationship then it doesn't realistically matter too much if you normalize them into two tables or denormalized them into one, because you wouldn't have data redundancy, and the row / table wouldn't grow much in total size by combining them into one table, so there's no concern there either.If the relationship from
Token
is one to many withTokenLiquidity
then that's a different story because you'll end up repeating properties of theToken
across multiple rows of theTokenLiquidity
table, which can make management more difficult should one of those properties' values change.The only thing I'd personal recommend, because it just makes sense structurally, is having a third separate table dedicated to storing the history of the
LiquidityToken
table and furthermore would specifically just store theToken0Amount
,Token1Amount
andAddress
(and maybe aDateTime
based field to log the when) fields to minimize data redundancy / maximize normalization. This will ensure yourLiquidityToken
table is as lightweight as possible, and ergo making the joins to theToken
table as efficient as possible too. The history table would only be joined to as needed, and if you had theDateTime
field, for only as far back as needed.