Message boards :
Number crunching :
Optimized RakeSearch app for rank 9 (computations finished)
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 . . . 7 · Next
Author | Message |
---|---|
Send message Joined: 30 Nov 17 Posts: 12 Credit: 47,549,281 RAC: 2,841 |
PEXT instruction is very slow on Ryzen, as I wrote above. Please try avx2nopext app version, it does not use it, and should be a bit faster than AVX one for you. AVX2 runs Great on my Ryzen 1800X Rig. I did have to exit the Client Software for the new file and settings to be used, Read you config files before you exit. I do have SMT Off But now we have no Tasks so we have to wait. http://rake.boincfast.ru/rakesearch/results.php?hostid=1731&offset=0&show_names=0&state=4&appid= Crunching@EVGA The Number One Team in the BOINC Community. Folding@EVGA The Number One Team in the Folding@Home Community. |
Send message Joined: 8 Sep 17 Posts: 99 Credit: 402,603,726 RAC: 0 |
I have fixed the avx2nopext app for WIndows, please try it again. Linux version was fine. I also added NEON app version for ARM CPUs. It is about 22% faster than non-NEON one. Before installing it please check if your device supports NEON instructions - open /proc/cpuinfo file.and check if there is "neon" in "Features" line. ARM: real 20m37.322s user 20m35.665s sys 0m0.155s ARM+NEON: real 15m58.774s user 15m57.060s sys 0m0.080s Edit: I have added test.tgz archive, which contains files needed to perform benchmark test. If you are using it, unpack this archive somewhere, copy rakesearch file to the same dir and run test.sh script. It is also possible to test Windows apps. You need to install Cygwin, and then follow above steps. Please do not rename rakesearch.exe to rakesearch, Cygwin will be able to run it as-is. Note: for some reason now Cygwin displays 0.000 as a user time, what is incorrect. It used to work properly when I was using Win7, I suspect that Win10 broke this. Please post your results. I am especially interested how AVX512 app compares with other app versions. |
Send message Joined: 8 Sep 17 Posts: 3 Credit: 120,921,607 RAC: 0 |
@Daniel, thanks for the optimized apps. You've made this project fun! I'm using the SSE, AVX, and AVX2 apps for both Windows and Linux. They are working very well on Intel CPUs. :) |
Send message Joined: 13 Dec 17 Posts: 1 Credit: 0 RAC: 0 |
Hi Daniel, and Anyone Else involved in this Optimized App, If this application's code works as well as some are reporting, then would this code be helpful for the other "Boinc" projects, or only for this project? If so, can this code be integrated into the Boinc software? Best regards, Phil phd21 |
Send message Joined: 8 Sep 17 Posts: 99 Credit: 402,603,726 RAC: 0 |
Hi Daniel, and Anyone Else involved in this Optimized App, This code is specific to this project, so it cannot be integrated directly with other projects or Boinc itself. However other projects may review all changes done by me, get familiar with optimization techniques used by me and then apply them to their apps. I only wonder about ODLK project, it also works with Latin Squares. Maybe it could directly integrate some code. |
Send message Joined: 9 Dec 17 Posts: 5 Credit: 70,443,903 RAC: 8,267 |
On my Odroid-XU4 i get this error: ../../projects/rake.boincfast.ru_rakesearch/rakesearch: error while loading shared libraries: libboinc_api.so.7: cannot open shared object file: No such file or directory http://rake.boincfast.ru/rakesearch/results.php?hostid=1797 |
Send message Joined: 8 Sep 17 Posts: 99 Credit: 402,603,726 RAC: 0 |
On my Odroid-XU4 i get this error: I have rebuilt ARM apps and now this lib is linked statically. Please download new app, it should work now. |
Send message Joined: 9 Dec 17 Posts: 5 Credit: 70,443,903 RAC: 8,267 |
On my Odroid-XU4 i get this error: Thank you so much. Now the arm_v7l_neon app run on Odroid-XU4, Odroid-HC1 and Jetson-TK1 without problem. |
Send message Joined: 6 Jan 18 Posts: 3 Credit: 1,747 RAC: 0 |
Hi, I am trying running AVX512 app and it is not triggering my AVX512 offset set in BIOS on my i9-7920X cpu, it runs with offset for AVX2. Is this app really using AVX512? edit: been observing it for a little longer and it occasionaly triggers AVX512 offset, but it is quite rare and only for a very short period of time. |
Send message Joined: 8 Sep 17 Posts: 99 Credit: 402,603,726 RAC: 0 |
Hi, I am trying running AVX512 app and it is not triggering my AVX512 offset set in BIOS on my i9-7920X cpu, it runs with offset for AVX2. Is this app really using AVX512? Answer is more complicated. This app version in most performance-critical place uses new AVX512 instruction which works on old AVX registers. Beside this there are some places where memory blocks are copied, what uses AVX512 registers. However these copies are made rarely. This matches with what you are observing. BTW, could you test performance of various app versions on your machine? In post linked below I wrote small instruction how to do this. I am mainly interested how AVX512 version compares with AVX2 one, I do not have any hardware to do such benchmark. http://rake.boincfast.ru/rakesearch/forum_thread.php?id=39&postid=237 |
Send message Joined: 6 Jan 18 Posts: 3 Credit: 1,747 RAC: 0 |
Sure, results for AVX2 and AVX512 on i9-7920X (offset for AVX2 is set to 4GHz, for AVX512 is set to 3.8GHz) under Windows 10: AVX2 real 3m32,268s user 0m0,000s sys 0m0,000s AVX512 real 3m40,743s user 0m0,000s sys 0m0,000s Yes, times are correct. I could set the offset same for the benchmarking if it makes a difference later today. |
Send message Joined: 12 Nov 17 Posts: 7 Credit: 6,461,078 RAC: 0 |
I downloaded rakesearch_linux_arm_v7l.tgz from github. On a pi 3 (not overclocked), with Raspbian Stretch, with boinc loaded sudo apt-get install boinc I ran the boinc manager and added Rakesearch (by URL). I ignored the warning "this project may not have units for your CPU" (or whatever it says). I usually run my pi 3's headless. I suppose i could have done it with boinccmd. The boinc manager showed me what was going on a bit quicker. I then installed the application: # get a root shell sudo bash # extract the binary: cd /var/lib/boinc-client/projects/rake.boincfast.ru_rakesearch/ tar xvf ~pi/rakesearch_linux_arm_v7l.tgz # exit the root shell exit Stopping and starting the boinc manager didn't work, so i restarted the pi sudo shutdown -r I let it download a couple units, which executed in 5 to 6 hours each. I chose the arm_v7l version as it is the one that i expected to work on the pi 3. I don't expect it to work on a pi 2. I have a pi 2 that runs Jessie, and i'll give it a try soon. I also have a pi zero w, and could give that a shot. I don't expect the NEON version (rakesearch_linux_arm_v7l_neon.tgz) to work on a pi 3. I might give it a try and see. It might possibly work with a 64 bit OS. That would be nice to know for sure, one way or the other. It might work on a banana pi or a higher end droid. I don't have either of these. The above process is more or less the same as on the x86, which was smooth for me. I'm running the rakesearch_linux_64_sse2.tgz version on an AMD Phenom (running Linux Mint 13). It's not young enough to support AVX. I also have an AMD A8 also on Mint 13, which does have AVX. I haven't attempted to run that as yet. I've only looked at Arm optimization a little bit. It looks complicated, and like a ton of work. In particular, getting the data to move in and out of the processor while the processor does the work looks difficult to get right. Daniel has clearly gotten it right, so it very likely was a ton of work. Thanks, very much. Stephen. |
Send message Joined: 8 Sep 17 Posts: 99 Credit: 402,603,726 RAC: 0 |
I downloaded rakesearch_linux_arm_v7l.tgz from github. Good to hear that!~ If you want to check if your RPI supports NEON or not, please execute following command. If it will print something, it would mean that your CPU supports NEON instructions. grep 'neon\|asimd' /proc/cpuinfo | head -1 I'm running the rakesearch_linux_64_sse2.tgz version on an AMD Phenom (running Linux Mint 13). It's not young enough to support AVX. I also have an AMD A8 also on Mint 13, which does have AVX. I haven't attempted to run that as yet. Well, most of this complicated stuff is done by compiler :). I had to find proper intrinsics which will do what I need, and this was most complicated part for me. Beside this things are similar to SSE/AVX programming :) |
Send message Joined: 8 Sep 17 Posts: 99 Credit: 402,603,726 RAC: 0 |
Sure, results for AVX2 and AVX512 on i9-7920X (offset for AVX2 is set to 4GHz, for AVX512 is set to 3.8GHz) under Windows 10: Thanks for results. This is interesting, I thought that AVX512 version would be faster a bit. I wonder if it is really slower, or it was some random execution time variation. If you execute test few times (e.g. 3 times), you will see that numbers are different each time. CPU load also influences results. Could you repeat these tests few times with BOINC suspended to confirm if AVX512 version is really slower instead of faster? |
Send message Joined: 6 Jan 18 Posts: 3 Credit: 1,747 RAC: 0 |
I run tests again several times with offset set the same for AVX2 and AVX512 to 0 =4300MHz and the results are: AVX2 real 3m32,724s user 0m0,000s sys 0m0,015s AVX512 real 3m25,637s user 0m0,000s sys 0m0,015s AVX512 looks (and is) faster, but when interpolated it is basically the same as the last time with offset set to -3 =4000MHz, but now you can see it clock to clock. BOINC and other CPU load intensive processes were suspended. I am available for another test when needed. Keep up the good work. |
Send message Joined: 8 Sep 17 Posts: 99 Credit: 402,603,726 RAC: 0 |
I run tests again several times with offset set the same for AVX2 and AVX512 to 0 =4300MHz and the results are: Thanks! These results looks reasonable, I was expecting something like this. Real WUs are about 6 times longer, so with AVX512 computations would complete about 40 seconds faster. PC running 24/7 would be able to complete 5 more WUs per core per day. |
Send message Joined: 12 Jan 18 Posts: 2 Credit: 170,241 RAC: 0 |
SABRE Lite i.MX6 (armv7l+neon@996MHz): ubuntu@viv2:~$ cat /proc/cpuinfo processor : 0 model name : ARMv7 Processor rev 10 (v7l) Features : swp half thumb fastmult vfp edsp neon vfpv3 tls vfpd32 CPU implementer : 0x41 CPU architecture: 7 CPU variant : 0x2 CPU part : 0xc09 CPU revision : 10 ... processor : 3 ... Hardware: Freescale i.MX6 Quad/DualLite (Device Tree) Revision: 63012 Serial: 0000000000000000 ubuntu@viv2:~$ cat /sys/devices/system/cpu/cpufreq/interactive/hispeed_freq 996000 ubuntu@viv2:~$ uname -a Linux viv2 3.14.28-11-boundary-9t6 #11 SMP PREEMPT Mon Jan 18 06:31:13 MST 2016 armv7l armv7l armv7l GNU/Linux ubuntu@viv2:~/BOINC/rakesearch_linux_arm_v7l_neon$ ./test.sh Started RakeSearch test... real39m34.509s user39m32.500s sys0m0.460s Files result.txt and result-ok.txt are identical ubuntu@viv2:~/BOINC/rakesearch_linux_arm_v7l_neon$ ubuntu@viv2:~/BOINC/rakesearch_linux_arm_v7l$ ./test.sh Started RakeSearch test... real44m23.962s user44m22.300s sys0m0.390s Files result.txt and result-ok.txt are identical |
Send message Joined: 26 Jan 18 Posts: 2 Credit: 16,579,090 RAC: 0 |
is the optimized app now part of the official package ? |
Send message Joined: 8 Sep 17 Posts: 99 Credit: 402,603,726 RAC: 0 |
is the optimized app now part of the official package ? Not yet, but this is in plans. BTW, I am going to release new optimized app version soon. Stay tuned! |
Send message Joined: 12 Nov 17 Posts: 7 Credit: 6,461,078 RAC: 0 |
Thanks Daniel. I grep'ed for sse2 on the Phenom, didn't think to grep for neon on the Arms. It turns out that both the pi 2 and the pi 3 Arm processors support NEON. Both processor systems have completed units. The pi 2 and pi 3 systems have gotten credit for NEON units. Pi zeros don't work with the accelerated apps. They error out right away. (I've turned them off.) One zero was running Jessie, and the other Stretch, but I'm sure it's the processor, not the OS. I've verified that the AMD A8 is in fact running the AVX accelerated app, and is successful. It's about 20% slower than the Phenom II, which doesn't have AVX, and is running SSE2. It's not unusual for the A8 to run 20% faster or 20% slower than the Phenom II on different apps or benchmarks. I might try the SSE2 app on the A8. I time these by pasting 20 valid units stats into a spreadsheet, and averaging. Stephen. |
©2024 The searchers team, Karelian Research Center of the Russian Academy of Sciences