MacBook Pro Performance – L1 Cache Size of 2012 MacBook Air and Comparison to Current Models

macbook promemoryperformanceprocessorvirtual-memory

Discussion in comments below this answer to improving speed of this numpy-based diffraction calculator suggest that the reason the script in the answer runs slow for me but fast for others (even on earlier versions of NumPy) might be that my late 2012 MacBook Air's processor L1 cache might be smaller than others. It could also be that I am running dangerously low on disk space (I was seeing 40 MB/sec reads and writes while running).

I'm curious though, what is the size of the on-processor L1 cache for my late 2012 MacBook Air, and how does it compare to new MacBooks?

MacBook Air:   13-inch, Mid 2012
Processor:     1.8 GHz Intel Core i5
Memory:        4 GB 1600 MHz DDR3
Hard Drive:    251 GB Flash Storage

I'm not a developer, but I did a small test. Running the script below I see that multiplication of two NumPy arrays is fastest (a few nanoseconds per float multiply) when the array size is about 10^4. Each is about 8 bytes, so I'm estimating that my L1 cache size is about 10^5 Bytes.

Is that close?

note: I estimate time using both time.time() and time.process_time(). The former (blue, solid line, lower values) is "people time", how long I have to wait for something to finish.

enter image description here

import numpy as np
import matplotlib.pyplot as plt
import time

Ns = np.logspace(1, 8, 15).astype(int)

t1, t2 = [], []
for N in Ns:
    x = np.random.random(N)
    t1_start = time.time()
    t2_start = time.process_time()
    n = int(1E+06/N)
    for i in range(n):
        y = x*x
    t1.append((time.time() - t1_start)/(N*n))
    t2.append((time.process_time() - t2_start)/(N*n))

if True:
    plt.figure()
    plt.plot(Ns, t1)
    plt.plot(Ns, t2, '--')
    plt.xscale('log')
    plt.yscale('log')
    plt.title('estimated time (sec) per float multiply vs array size', fontsize=14)
    plt.show()

Best Answer

Your current MacBook Air has an L1 cache of 128 kB. So your guess of 10^5 bytes is not far off!

The latest MacBook Air has the same amount of L1 cache, so in that respect it is the same. However, other parts of the CPU are vastly different.

Note that your might see that some of these CPUs are listed as having "Smart Cache", which essentially means that some caches are shared among multiple cores. In your case, the L1 cache is never shared - so it is always a private cache for each core. This means that if your code is single threaded, then the actual amount of L1 cache used by that program is actually 64 kB and not 128 kB.

In addition to that, the L1 cache is actually not a single cache as such,. It is split into seperate portions for instructions and data. This means that with the 64 kB, you will have cached 32 kB of data and 32 kB of instructions (program code).

You might wonder why the new MacBook Air has the same amount of L1 cache and think that it hasn't kept up with the development in the industry - but that's actually not how it is. The size of the L1 has been pretty constant for years and years. 20 years ago you might have seen 32 kB L1 cache per core, the last many years 64 kB L1 cache has been pretty standard.

For comparison the latest Mac Pro (desktop computer) with the most expensive CPU option (the Intel Xeon W-3265M 24 core CPU) also has a 64 kB L1 cache per core.

Just increasing the size of the L1 cache won't really do you any good in terms of increasing overall performance. L1 cache is the most "expensive" type of cache in terms of the amount of physical space taken up on the die, and in terms of power usage. In addition, because it is even faster than TLB lookups, the extra benefit of a larger size quickly diminishes. In essence, you'll get more benefit from having a larger L2 or L3 cache than increasing the size of the L1 cache - and it's much "cheaper" to increase those.

All in all this seems to indicate that different L1 cache sizes isn't the deciding factor that determines that your computer is slow and others are fast.

I could guess that perhaps a big difference lies in the fact that your old processor does not support the same "number-crunching" instruction sets as a new processor does. This small test example doesn't really show it, but from your links it appears that you're really doing some computations with vectors. Specialized CPU instructions exists that really benefit those types of computations - and I know that numpy can take advantage of at least some of them. Your current CPU only supports the older AVX instruction sets, whereas a new MacBook Air supports AVX2 in addition to SSE4.

You might for comparison's sake try disabling AVX2/SSE4 support in numpy on the fast machine and rerun the calculation to see if that accounts for the majority of the time difference.