Well, I do not expect a concise answer than the one available from here.
What I understand about 32-bit OS is, the address is expressed in 32 bits, so at most the OS could use 2^32 = 4GB memory space
The most that the process can address is 4GB. You are potentially confusing memory with address space. A process can have more memory than address space. That is perfectly legal and quite common in video processing and other memory intensive applications. A process can be allocated dozens of GB of memory and swap it into and out of the address space at will. Only 2 GB can go into the user address space at a time.
If you have a four-car garage at your house, you can still own fifty cars. You just can't keep them all in your garage. You have to have auxiliary storage somewhere else to store at least 46 of them; which cars you keep in your garage and which ones you keep in the parking lot down the street is up to you.
Does this mean any 32-bit OS, be it Windows or unix, if the machine has RAM + page file on hard disk more than 4GB, for example 8GB RAM and 20GB page file, there will never be "memory used up"?
Absolutely it does not mean that. A single process could use more memory than that! Again the amount of memory a process uses is almost completely unrelated to the amount of virtual address space a process uses. Just like the number of cars you keep in your garage is completely unrelated to the number of cars you own.
Moreover, two processes can share non-private memory pages. If twenty processes all load the same DLL, the processes all share the memory pages for that code. They don't share virtual memory address space, they share memory.
My point, in case it is not clear, is that you should stop thinking of memory and address space as the same thing, because they're not the same thing at all.
if this 32-bit OS machine has 2GB RAM and 2GB page file, increasing the page file size won't help the performance. Is this true?
You have fifty cars and a four-car garage, and a 100 car parking lot down the street. You increase the size of the parking lot to 200 spots. Do any of your cars get faster as a result of you now having 150 extra parking spaces instead of 50 extra parking spaces?
After a lot more searching I think I have convinced myself that there is no simple way to get what I want.
So, what did I end up doing? I installed LiME from github (https://github.com/504ensicsLabs/LiME)
git clone https://github.com/504ensicsLabs/LiMe
cd /LiME/src
make -C /lib/modules/`uname -r`/build M=$PWD modules
The above commands create the lime.ko kernel module. A full dump of memory can be obtained by then running:
insmod ./lime.ko "path=/root/temp/outputDump.bin format=raw dio=0"
which just inserts the kernel module and the string are the parameters specifying the output file location and format... AND IT WORKED! YAY.
Best Answer
The kernel sees the physical memory and provides a view to the processes. If you ever wondered how a process can have a 4 GB memory space if your whole machine got only 512 MB of RAM, that's why. Each process has its own virtual memory space. The addresses in that address space are mapped either to physical pages or to swap space. If to swap space, they'll have to be swapped back into physical memory before your process can access a page to modify it.
The example from Torvalds in XQYZ's answer (DOS highmem) is not too far fetched, although I disagree about his conclusion that PAE is generally a bad thing. It solved specific problems and has its merits - but all of that is argumentative. For example the implementer of a library may not perceive the implementation as easy, while the user of that library may perceive this library as very useful and easy to use. Torvalds is an implementer, so he's bound to say what the statement says. For an end user this solves a problem and that's what the end user cares about.
For one PAE helps solve another legacy problem on 32bit machines. It allows the kernel to map the full 4 GB of memory and work around the BIOS memory hole that exists on many machines and causes a pure 32bit kernel without PAE to "see" only 3.1 or 3.2 GB of memory, despite the physical 4 GB.
Anyway, for the 64bit kernel it's a symmetrical relation between the page physical and the virtual pages (leaving swap space and other details aside). However, the PAE kernel maps between a 32bit pointer within the process' address space and a 36bit address in physical memory. More book-keeping is needed here. Keyword: "Extended Page-Table". But this is somewhat more of a programming question. This is the main difference. More book-keeping compared to a full linear address space. For PAE it's chunks of 4 GB as you mentioned.
Aside from that both PAE and 64bit allow for large pages (instead of the standard 4 KB pages in 32bit).
Chapter 3 of Volume 1 of the Intel Processor Manual has some overview and Chapter 3 of Volume 3A ("Protected Mode Memory Management") has more details, if you want to read up on it.
You're right. However, the majority of people are users, not implementers. That's why they won't care. And as long as you don't require huge amounts of memory for your application, many people don't care (especially since there are compatibility layers).