From ranjith_kumar_b4u at yahoo.com Tue Dec 5 20:28:35 2006 From: ranjith_kumar_b4u at yahoo.com (ranjith kumar) Date: Tue Dec 5 21:27:28 2006 Subject: prefetching on pentium4 In-Reply-To: <3bbf2fe10611160753q3303d81bw515bffe9af4ee0c9@mail.gmail.com> Message-ID: <20061206042834.59293.qmail@web58611.mail.re3.yahoo.com> Hi, There are 4 types of prefetch instructions on pentium 4 (IA-32) processor. prefetchnta,prefetcht0,prefetcht1,prefetcht2. In case of pentium 4, IA-32 otimization manuvals say that prefetcht0,prefetcht1,prefetcht2 are identical. It also says ONLY prefetchnta instruction prefetches data into L2 cache without poluting caches. When all the four instructions prefetches data into L2 cache (not into L1 cache) , what is the meaning in saying prefetchnta does not polute caches? ie)what is the difference between prefetchnta and other instructions? Thanks in advance. ____________________________________________________________________________________ Do you Yahoo!? Everyone is raving about the all-new Yahoo! Mail beta. http://new.mail.yahoo.com From olivier.certner at free.fr Wed Dec 6 00:56:21 2006 From: olivier.certner at free.fr (Olivier Certner) Date: Wed Dec 6 00:56:25 2006 Subject: prefetching on pentium4 In-Reply-To: <20061206042834.59293.qmail@web58611.mail.re3.yahoo.com> References: <20061206042834.59293.qmail@web58611.mail.re3.yahoo.com> Message-ID: <200612060954.48736.> Hi, On a pentium 4, prefetcht0, prefetcht1 and prefetch2 are identical, at least if you don't have a level 3 cache. Intel's documentation is not very clear about what happens with one more cache in the hierarchy. The prefetchnta instruction does the same thing (fetch some memory bytes into the 2nd level cache) but it is supposed to fetch these bytes in only one way of the cache. I don't know how the way is choosen. Unless you are trying to fetch a relatively large volume of data or data with a special pattern (ie, data that would be put at the same index in the cache, thus utilizing more than one way), you won't see much difference from the prefetchtX variants. You'll have to determine the characteristics of the L2 cache on your paticular P4 processor target in order to check that. Olivier From attilio at freebsd.org Wed Dec 6 11:20:39 2006 From: attilio at freebsd.org (Attilio Rao) Date: Wed Dec 6 11:20:44 2006 Subject: prefetching on pentium4 In-Reply-To: <20061206042834.59293.qmail@web58611.mail.re3.yahoo.com> References: <3bbf2fe10611160753q3303d81bw515bffe9af4ee0c9@mail.gmail.com> <20061206042834.59293.qmail@web58611.mail.re3.yahoo.com> Message-ID: <3bbf2fe10612061050y6fa458abw3b1ace0cd1bebd37@mail.gmail.com> 2006/12/6, ranjith kumar : > Hi, > There are 4 types of prefetch instructions on > pentium 4 (IA-32) processor. > prefetchnta,prefetcht0,prefetcht1,prefetcht2. > > In case of pentium 4, IA-32 otimization manuvals say > that prefetcht0,prefetcht1,prefetcht2 are identical. > > It also says ONLY prefetchnta instruction prefetches > data into L2 cache without poluting caches. > > When all the four instructions prefetches data into > L2 cache (not into L1 cache) , what is the meaning in > saying prefetchnta does not polute caches? > > ie)what is the difference between prefetchnta and > other instructions? First of all, it is important to say that prefetch* instruction is only an hint for the CPU and not a *command* for that, so the CPU needs to evaluate (in a not precisated way) if accept or not the caching request. >From this point of view, prefetch* instruction might be the more accomodant possible for the CPU. Different numbers mean different 'critical' level for the CPU (0 - high critical, 2 - low critical), which means prefetching the cache line to an higher level into the cache hierarchy. This would means, in an hypotetical way: prefetch0 -> L1 prefetching prefetch1 -> L2 prefetching prefetch2 -> L3 prefetching And this is what really happens, for example, on P3 (if you consider P3 has not L3 cache, prefetch2 == prefetch1). On P4 things are different beacause you would not manipulate directly L1 cache and, so, what happens is: prefetch0 -> L2 prefetching prefetch1 -> L2 prefetching prefetch2 -> L3 prefetching (if L3 cache is not present prefetch2 is the same as the other, from this the assumption all the three instructions behave at the same). prefetchnta is completely different beacause it fetches a cache line into the NT cache structure. Non Temporal caches are global caches which are particulary powerful beacause they don't need of snooping messages between CPUs (and, in this way, they reduce the CPUs<->caches traffic) and are used by NTI family. Attilio -- Peace can only be achieved by understanding - A. Einstein From ranjith_kumar_b4u at yahoo.com Mon Dec 11 08:31:16 2006 From: ranjith_kumar_b4u at yahoo.com (ranjith kumar) Date: Mon Dec 11 11:27:11 2006 Subject: prefetching on pentium 4 Message-ID: <234144.12400.qm@web58616.mail.re3.yahoo.com> --- Attilio Rao wrote: > 2006/12/6, ranjith kumar > : > > Hi, > > There are 4 types of prefetch instructions on > > pentium 4 (IA-32) processor. > > prefetchnta,prefetcht0,prefetcht1,prefetcht2. > > > > In case of pentium 4, IA-32 otimization manuvals > say > > that prefetcht0,prefetcht1,prefetcht2 are > identical. > > > > It also says ONLY prefetchnta instruction > prefetches > > data into L2 cache without poluting caches. > > > > When all the four instructions prefetches data > into > > L2 cache (not into L1 cache) , what is the meaning > in > > saying prefetchnta does not polute caches? > > > > ie)what is the difference between prefetchnta and > > other instructions? > > First of all, it is important to say that prefetch* > instruction is > only an hint for the CPU and not a *command* for > that, so the CPU > needs to evaluate (in a not precisated way) if > accept or not the > caching request. > From this point of view, prefetch* instruction might > be the more > accomodant possible for the CPU. > Different numbers mean different 'critical' level > for the CPU (0 - > high critical, 2 - low critical), which means > prefetching the cache > line to an higher level into the cache hierarchy. > This would means, in an hypotetical way: > > prefetch0 -> L1 prefetching > prefetch1 -> L2 prefetching > prefetch2 -> L3 prefetching > > And this is what really happens, for example, on P3 > (if you consider > P3 has not L3 cache, prefetch2 == prefetch1). > On P4 things are different beacause you would not > manipulate directly > L1 cache and, so, what happens is: > > prefetch0 -> L2 prefetching > prefetch1 -> L2 prefetching > prefetch2 -> L3 prefetching > (if L3 cache is not present prefetch2 is the same as > the other, from > this the assumption all the three instructions > behave at the same). > > prefetchnta is completely different beacause it > fetches a cache line > into the NT cache structure. > Non Temporal caches are global caches which are > particulary powerful > beacause they don't need of snooping messages > between CPUs (and, in > this way, they reduce the CPUs<->caches traffic) and > are used by NTI > family. Thanks. But when I executed two programs one prefetching using prefetchnta and the second using prefetcht0, the second program executed faster. (I used pentium4 processor and gcc compiler.)What could be the reason?When prefechnta is preferable over prefecht0? Also in "IA-32 systems programmers manual" nothing about nontemporal cache structure is written.The caches in IA-32 processors are L1 cache, L2 cache,write-combing cache,store buffer, instruction TLB and data TLB and L3 cache(not present in pentium4). Does non temporal cache and write combining buffer are same? Thanks in advance. > > Attilio > > > -- > Peace can only be achieved by understanding - A. > Einstein > ____________________________________________________________________________________ Want to start your own business? Learn how on Yahoo! Small Business. http://smallbusiness.yahoo.com/r-index From attilio at freebsd.org Mon Dec 11 15:25:16 2006 From: attilio at freebsd.org (Attilio Rao) Date: Mon Dec 11 15:25:20 2006 Subject: prefetching on pentium 4 In-Reply-To: <234144.12400.qm@web58616.mail.re3.yahoo.com> References: <234144.12400.qm@web58616.mail.re3.yahoo.com> Message-ID: <3bbf2fe10612111524i60cb7807wfbb9228b6c8d4b39@mail.gmail.com> 2006/12/11, ranjith kumar : > --- Attilio Rao wrote: > > > 2006/12/6, ranjith kumar > > : > > > Hi, > > > There are 4 types of prefetch instructions on > > > pentium 4 (IA-32) processor. > > > prefetchnta,prefetcht0,prefetcht1,prefetcht2. > > > > > > In case of pentium 4, IA-32 otimization manuvals > > say > > > that prefetcht0,prefetcht1,prefetcht2 are > > identical. > > > > > > It also says ONLY prefetchnta instruction > > prefetches > > > data into L2 cache without poluting caches. > > > > > > When all the four instructions prefetches data > > into > > > L2 cache (not into L1 cache) , what is the meaning > > in > > > saying prefetchnta does not polute caches? > > > > > > ie)what is the difference between prefetchnta and > > > other instructions? > > > > First of all, it is important to say that prefetch* > > instruction is > > only an hint for the CPU and not a *command* for > > that, so the CPU > > needs to evaluate (in a not precisated way) if > > accept or not the > > caching request. > > From this point of view, prefetch* instruction might > > be the more > > accomodant possible for the CPU. > > Different numbers mean different 'critical' level > > for the CPU (0 - > > high critical, 2 - low critical), which means > > prefetching the cache > > line to an higher level into the cache hierarchy. > > This would means, in an hypotetical way: > > > > prefetch0 -> L1 prefetching > > prefetch1 -> L2 prefetching > > prefetch2 -> L3 prefetching > > > > And this is what really happens, for example, on P3 > > (if you consider > > P3 has not L3 cache, prefetch2 == prefetch1). > > On P4 things are different beacause you would not > > manipulate directly > > L1 cache and, so, what happens is: > > > > prefetch0 -> L2 prefetching > > prefetch1 -> L2 prefetching > > prefetch2 -> L3 prefetching > > (if L3 cache is not present prefetch2 is the same as > > the other, from > > this the assumption all the three instructions > > behave at the same). > > > > prefetchnta is completely different beacause it > > fetches a cache line > > into the NT cache structure. > > Non Temporal caches are global caches which are > > particulary powerful > > beacause they don't need of snooping messages > > between CPUs (and, in > > this way, they reduce the CPUs<->caches traffic) and > > are used by NTI > > family. > Thanks. But when I executed two programs one > prefetching using prefetchnta and the second using > prefetcht0, the second program executed faster. > (I used pentium4 processor and gcc compiler.)What > could be the reason?When prefechnta is preferable over > prefecht0? As I said, prefetchnta is particulary important in SMP systems. Are you using a dual-core CPU? In this case CPUs in order to mantain their caches syncronized need to do snooping procedures (that are exactly explained into the "IA32 Software Developers Manual, vol 3" (sorry but I can't remind the n. of the chapter, BTW it is the one speaking about cache tricks)) which will take the CPU-cache buses. Using prefetchnta, bytes are fetched into the NT cache system, so the snooping traffic doesn't affect performance for load/store. > Also in "IA-32 systems programmers manual" nothing > about nontemporal cache structure is written.The > caches in IA-32 processors are L1 cache, L2 > cache,write-combing cache,store buffer, instruction > TLB and data TLB and L3 cache(not present in > pentium4). Does non temporal cache and write combining > buffer are same? No, they are not. Attilio -- Peace can only be achieved by understanding - A. Einstein From ranjith_kumar_b4u at yahoo.com Wed Dec 13 04:25:05 2006 From: ranjith_kumar_b4u at yahoo.com (ranjith kumar) Date: Wed Dec 13 04:53:01 2006 Subject: writing to performance event select registers Message-ID: <75870.6526.qm@web58615.mail.re3.yahoo.com> Hi, I want to measure number of last level cache misses in Pentium 4 processor. In IA-32 programmers manuals it was given that there are (architectural= same across all IA-32 processors)perfomance monitoring counters starting at address 0c1H and performance_event_select registers starting at address 186H. 1) When I tried to run a kernel module to write some value in performance event select register (with address 186H) by wrmsr instruction, the system is hanging.Why? The program is : #include /* Needed by all modules */ #include /* Needed for KERN_INFO */ //#include int i,j,k=0; unsigned int xx,yy,xx1,yy1,xx2,yy2; unsigned int t1,t2,t3,t4,BIG=0xffffffff; int init_module(void) { asm volatile (" movl $0x186, %%ecx;" " movl $0x0, %%edx;" " movl $0x0009412E, %%eax;" " wrmsr;" : : :"%eax","%edx","%ecx"); printk(KERN_INFO " Initially %u=t1 %u=t2 %u=t3 %u=t4 \n",t1,t2,t3,t4); return 0; } void cleanup_module(void) { printk(KERN_INFO "Goodbye world \n"); } ------------------------------------------------------- Thanks in advane. ____________________________________________________________________________________ Do you Yahoo!? Everyone is raving about the all-new Yahoo! Mail beta. http://new.mail.yahoo.com