Message boards :
Number crunching :
Optimized RakeSearch app for rank 9 (computations finished)
Message board moderation
Author | Message |
---|---|
Send message Joined: 8 Sep 17 Posts: 99 Credit: 402,603,726 RAC: 0 |
Hi all, As you may have noticed, I was working on optimized app version, and was testing it on my machines. After applying series of various code optimizations I got app which is way faster than original one. On top of this I added support for SSE/AVX, what added some extra boost. Here are results for processing sample small workunit on my Haswell Xeon running Linux CentOS: Original app: real 13m29.530s user 13m27.579s sys 0m0.027s SSE2: real 1m26.704s user 1m24.704s sys 0m0.004s AVX: real 1m27.987s user 1m25.985s sys 0m0.005s AVX2+BMI2: real 1m20.868s user 1m18.872s sys 0m0.003s As you can see, in this test AVX app is 10 times faster! For real WUs this speedup varies from WU to WU, but it is still about 4-5 times, and most WUs on this machine completes in less than hour. Optimized app can be downloaded from GitHub: https://github.com/sirzooro/RakeSearch/releases/tag/v1.0. (New version 1.1 available!). There are multiple app versions, compiled with support for different instruction sets. If you are not sure what your CPU supports, on Windows use CPU-Z, and on Linux check "flags" in /proc/cpuinfo file. In order to install this app, perform these steps: - close BOINC (config reload will not work); - unpack archive to project directory - on Windows it is path like "C:\Users\All Users\BOINC\projects\rake.boincfast.ru_rakesearch", on Linux /var/lib/boinc/projects/rake.boincfast.ru_rakesearch/ . On Linux also please make sure that rakesearch file is executable, and both rakesearch and app_info.xml are owned by boinc/boinc user/group; - start BOINC again. After doing this, in event log you should see entry for RakeSearch like "Found app_info.xml; using anonymous platform". Additionally you should see (Opti v1.0) in app name displayed in BOINC Mgr. All app versions checks if CPU and OS supports required instruction sets. If they are not, app will print appropriate error message and exit with code 1. AVX/AVX2 app versions requires at least Windows 7 SP1, Windows Server 2008 R2 SP1 or Linux with kernel 2.6.30. AVX512 app versions requires at least Windows 10, Windows Server 2016 or Linux with kernel 3.15. I am not sure about Windows versions, you can try if earlier versions can run it too. Similar performance of SSE2 and AVX version is expected, as AVX instruction set is mostly dedicated for floating point operations, which are not used in this app. AVX app version probably can be skipped at all. AVX2 added integer and bitwise operations which use new AVX registers, so this app version is faster than SSE2/AVX versions. Additional boost comes from BMI2 instructions, which came handy in few places. As far as I can tell, BMI2 is supported by all CPUs which supports AVX2. AVX512 version should be even faster, thanks to new mask registers. I do not have CPU with them, so I cannot check this. I only tested my code on emulator to make sure that it is works correctly. At this moment there is no AVX512 app for Linux - I have to compile new compiler version which will support it. I will add this app version later. Windows apps are compiled with MinGW gcc, and should work on WindowsXP. |
Send message Joined: 8 Sep 17 Posts: 44 Credit: 11,250,499 RAC: 3,468 |
Thanks for this Daniel, great work and your an asset to the project. Does the Win 32 XP app require a GPU? I installed the download and got the message that "App version needs Open CL and my GPU does not support it" (which it doesn't as it's an AMD/ATI 4800 type). I have not selected GPU for anything so why would that matter? Still waiting for some work to download to my Linux machines to see the new speed up. Thanks for your efforts Conan |
Send message Joined: 8 Sep 17 Posts: 44 Credit: 11,250,499 RAC: 3,468 |
Thanks for this Daniel, great work and your an asset to the project. UPDATE: Your the man Daniel, don't worry about my above comments, the OpenCL thing does not stop the app you compiled from working, so I wouldn't worry about it. My Windows XP 32 bit machine has now processed it's first work units and they have validated as well, and in under 40 minutes. My work on the Linux machines was taking over 6 hours, can't wait to get more work on them. Thanks again Conan |
Send message Joined: 8 Sep 17 Posts: 99 Credit: 402,603,726 RAC: 0 |
I have added Linux AVX512 app, and apps for Linux ARM and Linux AARCH64. ARM app was compiled on Odroid XU4 with ARM v7l CPU, I am not sure if it will work on earlier CPU versions - please try and let me know. I also found that I measured time incorrectly - it turned out that I had checkpoint file created, and I measured time only for last part of calculations. Ooops! :) I have repeated my tests, and got following results. This also includes results for ARM app on Odroid XU4, and AARCH64 app on Odroid CU2: Original app: real 54m57.442s user 54m55.481s sys 0m0.346s SSE2: real 6m2.431s user 6m0.451s sys 0m0.030s AVX: real 5m45.740s user 5m43.759s sys 0m0.026s AVX2: real 5m24.624s user 5m22.626s sys 0m0.042s Odroid XU4 - ARMv7 Processor rev 3 (v7l) real 20m37.322s user 20m35.665s sys 0m0.155s Odroid CU2 - AARCH64 real 26m45.051s user 26m42.920s sys 0m0.060s As you can see, this time AVX app has clear advantage over SSE2 one. So this app version should stay. AARCH64 app is slower that ARM one in this test, but on real WUs it is faster. Total runtime is about 3-4 hours on my devices. Thanks for this Daniel, great work and your an asset to the project. No, it is a CPU app. Strange. Where do you see this message? |
Send message Joined: 7 Sep 17 Posts: 35 Credit: 1,709,555 RAC: 432 |
Thank you Daniel! The Win_32_sse2 is running on my XP PC. Never could get the stock apps to run on that PC. Running the Win_64_sse2 on one PC. Will add it to rest of my 64-bit PC's over the next day or so. |
Send message Joined: 8 Sep 17 Posts: 44 Credit: 11,250,499 RAC: 3,468 |
I have added Linux AVX512 app, and apps for Linux ARM and Linux AARCH64. ARM app was compiled on Odroid XU4 with ARM v7l CPU, I am not sure if it will work on earlier CPU versions - please try and let me know. It was in the Event Log at the restart of BOINC, it is not a problem so I wouldn't worry about it. Thanks again Conan |
Send message Joined: 8 Sep 17 Posts: 22 Credit: 19,171,868 RAC: 12,035 |
Not sure it's working on my 1950x with AVX2 app. 34min into my 1st set of tasks and its only 55% done. My older 3770k with only AVX is 19min in and 66%. |
Send message Joined: 8 Sep 17 Posts: 99 Credit: 402,603,726 RAC: 0 |
Not sure it's working on my 1950x with AVX2 app. 34min into my 1st set of tasks and its only 55% done. My older 3770k with only AVX is 19min in and 66%. It works, but slower than expected. Please try AVX version, it may be faster for you. Recently I read that PEXT instruction from BMI2 set is slow on AMD CPUs, and AVX2 app uses it in most performance-critical part. This can explain why app is so slow on AMD CPU. https://www.reddit.com/r/Amd/comments/60i6er/ryzen_and_bmi2_strange_behavior_and_high_latencies/ Maybe AVX2 app without BMI instructions would be better here. I will take a look on this. |
Send message Joined: 8 Sep 17 Posts: 22 Credit: 19,171,868 RAC: 12,035 |
Not sure it's working on my 1950x with AVX2 app. 34min into my 1st set of tasks and its only 55% done. My older 3770k with only AVX is 19min in and 66%. Yeah, I guess I meant it wasn't working as well as expected. Seeing big numbers put up by others. Point/CPU Sec went from 0.0454 average to 0.0613 with the AVX app. A good 35% improvement on the 1950x. Great job and thanks for another optimized app for BOINC! |
Send message Joined: 8 Sep 17 Posts: 99 Credit: 402,603,726 RAC: 0 |
I found that AVX2 app for AMD can still use other BMI2 instructions, it should not use PEXT/PDEP only. I have created such app and uploaded to GitHub, it has "avx2nopext" in file name. Here are performance results from my Xeon Haswell. I added results for existing AVX and AVX2 apps for comparison. As you can see, new app is a bit faster that AVX. Please check if it is also faster on your machine. AVX: real 5m45.740s user 5m43.759s sys 0m0.026s AVX2+BMI2: real 5m24.624s user 5m22.626s sys 0m0.042s AVX2+BMI2, without PEXT: real 5m38.600s user 5m36.622s sys 0m0.022s I also added NEON instructions to AARCH64 app, what improved app speed by ~20%. NEON instructions are always available on AARCH64, so I replaced existing non-NEON app with NEON one on GitHub. AARCH64, no NEON: real 26m45.051s user 26m42.920s sys 0m0.060s AARCH64, NEON: real 20m54.181s user 20m52.180s sys 0m0.070s |
Send message Joined: 8 Sep 17 Posts: 22 Credit: 19,171,868 RAC: 12,035 |
Trying it out now. Posting as a time reference between apps. :) |
Send message Joined: 11 Oct 17 Posts: 2 Credit: 63,415 RAC: 0 |
Thank you very much for this awesome app Daniel, running times of 10 times faster are fantastic. Your app should be the standard app for this project! |
Send message Joined: 11 Sep 17 Posts: 51 Credit: 194,406,895 RAC: 2,340 |
http://rake.boincfast.ru/rakesearch/top_hosts.php There is daniel top host with 56 cores and 178,439.69 day RAC with linux and when you compare everything else under them so it's an abnormal rise.. Of course i hope all hosts under, use daniel s good optimized app.. (like 88core use avx2 app) But here the question arises whether it is really linux too good ..? Or is there any optimization that is not accessible to the public? |
Send message Joined: 8 Sep 17 Posts: 99 Credit: 402,603,726 RAC: 0 |
http://rake.boincfast.ru/rakesearch/top_hosts.php No, reason is different. RAC changes slowly, is is averaged over long period of time (something like few weeks). I started running early version of my app on this host about 3 weeks earlier before I created and officially released current version here. Because of this my host already have high RAC, while other ones still has to catch up. This difference in RAC should disappear within few next weeks. |
Send message Joined: 11 Sep 17 Posts: 51 Credit: 194,406,895 RAC: 2,340 |
Thank you for answer. I guessed too that is some remmaint rac credit or long run on one host.. i was bit trolling with this q. but important; i also find on Ryzen s cpus is best only AVX app. but work good . my host s with ryzen cpu have small oveclock becouse of lack water cooling and chipset overheating,becouse this project and app heating chipset. On intel cpu s are all app absolutly fantastic. proably that s why we have all badges now)) hope project add more. really like to see animals,even is this math.project. it is refreshing .. )) But I am a little disappointed on TH 1950x I hope to you ,find some way to pull out as much as possible from this Cpu ..becouse "old father Moroz" was here ......)))) Interesting data from users would be how fast is intel 512bit task on some 7960x,7980x,,cpus |
Send message Joined: 8 Sep 17 Posts: 99 Credit: 402,603,726 RAC: 0 |
Thank you for answer. No problem, it's your karma anyway ;) but important; i also find on Ryzen s cpus is best only AVX app. but work good . my host s with ryzen cpu have small oveclock becouse of lack water cooling and chipset overheating,becouse this project and app heating chipset. PEXT instruction is very slow on Ryzen, as I wrote above. Please try avx2nopext app version, it does not use it, and should be a bit faster than AVX one for you. Interesting data from users would be how fast is intel 512bit task on some 7960x,7980x,,cpus Good idea. I will prepare some script which will help to benchmark different app versions. |
Send message Joined: 11 Sep 17 Posts: 51 Credit: 194,406,895 RAC: 2,340 |
PEXT instruction is very slow on Ryzen, as I wrote above. Please try avx2nopext app version, it does not use it, and should be a bit faster than AVX one for you. I tried to a few days ago. but all end up immediately with a bug/error and then the project start blocking me from download new units.also change all my rest tasks in boinc.m to error task.. on ryzen 1700,1700x .. so i back to AVX after deatach project in boinc manager.. Soo i dont know..but i will do later new tests..)) |
Send message Joined: 16 Nov 17 Posts: 4 Credit: 13,065,220 RAC: 2,129 |
Could someone please explain the process for correctly unpacking the optimized app files in Linux? I have successfully downloaded and extracted the files to my desktop, but when attempting to place them in the rakesearch folder I hit a dead end. I must be going about this the wrong way. I am trying to use the same process as setting up a cc_config file and it is not working. So far in Linux Mint Xfce 18.2: (1) Download file (2) Extract contents to desktop (couldn't figure out how to extract directly to the rakesearch folder as in Win 7) (3) Tried using gksudo xed /var/lib/boinc-client/projects/rake.boincfast.ru_rakesearch/ to open the destination folder and add contents but no go. Is the command wrong or do I need to add /home/skivelitis before /var /lib? Or as is most likely am I completely off-base? I have been using Linux for about a year now but only on dedicated number crunchers and am definitely a noob. Thanks in advance. |
Send message Joined: 8 Sep 17 Posts: 99 Credit: 402,603,726 RAC: 0 |
Could someone please explain the process for correctly unpacking the optimized app files in Linux? I have successfully downloaded and extracted the files to my desktop, but when attempting to place them in the rakesearch folder I hit a dead end. I must be going about this the wrong way. I am trying to use the same process as setting up a cc_config file and it is not working. I do not use desktop on Linux, only shell :) Here are required commands to execute. You may have to adjust paths and URLs: su - cd /var/lib/boinc/projects/rake.boincfast.ru_rakesearch/ wget https://github.com/sirzooro/RakeSearch/releases/download/v1.0/rakesearch_linux_64_avx.tgz tar zxvf rakesearch_linux_64_avx.tgz systemctl restart boinc-client Above commands are enough to download, unpack and install AVX app on CentOS 7. You may have to adjust them a bit for your Linux version. You may have BOINC in /var/lib/boinc-client/... dir, and its service may be called boinc instead of boinc-client. BTW, Boinc prints path to its dir in event log when it starts, you can look for it there. |
Send message Joined: 8 Sep 17 Posts: 99 Credit: 402,603,726 RAC: 0 |
PEXT instruction is very slow on Ryzen, as I wrote above. Please try avx2nopext app version, it does not use it, and should be a bit faster than AVX one for you. Good to know that is does not work :) I suspect what may be wrong, but today I do not have access to my PC - I will do it tomorrow. |
©2024 The searchers team, Karelian Research Center of the Russian Academy of Sciences