Optimized RakeSearch app

Message boards : Number crunching : Optimized RakeSearch app

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
Profile [B@P] Daniel

Send message
Joined: 8 Sep 17
Posts: 63
Credit: 140,190,555
RAC: 28,448
Message 172 - Posted: 25 Nov 2017, 20:11:12 UTC
Last modified: 25 Nov 2017, 20:14:42 UTC

Hi all,
As you may have noticed, I was working on optimized app version, and was testing it on my machines. After applying series of various code optimizations I got app which is way faster than original one. On top of this I added support for SSE/AVX, what added some extra boost. Here are results for processing sample small workunit on my Haswell Xeon running Linux CentOS:
Original app:
real    13m29.530s
user    13m27.579s
sys     0m0.027s

SSE2:
real    1m26.704s
user    1m24.704s
sys     0m0.004s

AVX:
real    1m27.987s
user    1m25.985s
sys     0m0.005s

AVX2+BMI2:
real    1m20.868s
user    1m18.872s
sys     0m0.003s

As you can see, in this test AVX app is 10 times faster! For real WUs this speedup varies from WU to WU, but it is still about 4-5 times, and most WUs on this machine completes in less than hour.

Optimized app can be downloaded from GitHub: https://github.com/sirzooro/RakeSearch/releases/tag/v1.0. There are multiple app versions, compiled with support for different instruction sets. If you are not sure what your CPU supports, on Windows use CPU-Z, and on Linux check "flags" in /proc/cpuinfo file.

In order to install this app, perform these steps:
- close BOINC (config reload will not work);
- unpack archive to project directory - on Windows it is path like "C:\Users\All Users\BOINC\projects\rake.boincfast.ru_rakesearch", on Linux /var/lib/boinc/projects/rake.boincfast.ru_rakesearch/ . On Linux also please make sure that rakesearch file is executable, and both rakesearch and app_info.xml are owned by boinc/boinc user/group;
- start BOINC again.

After doing this, in event log you should see entry for RakeSearch like "Found app_info.xml; using anonymous platform". Additionally you should see (Opti v1.0) in app name displayed in BOINC Mgr.

All app versions checks if CPU and OS supports required instruction sets. If they are not, app will print appropriate error message and exit with code 1.

AVX/AVX2 app versions requires at least Windows 7 SP1, Windows Server 2008 R2 SP1 or Linux with kernel 2.6.30.
AVX512 app versions requires at least Windows 10, Windows Server 2016 or Linux with kernel 3.15. I am not sure about Windows versions, you can try if earlier versions can run it too.

Similar performance of SSE2 and AVX version is expected, as AVX instruction set is mostly dedicated for floating point operations, which are not used in this app. AVX app version probably can be skipped at all.
AVX2 added integer and bitwise operations which use new AVX registers, so this app version is faster than SSE2/AVX versions. Additional boost comes from BMI2 instructions, which came handy in few places. As far as I can tell, BMI2 is supported by all CPUs which supports AVX2.
AVX512 version should be even faster, thanks to new mask registers. I do not have CPU with them, so I cannot check this. I only tested my code on emulator to make sure that it is works correctly.

At this moment there is no AVX512 app for Linux - I have to compile new compiler version which will support it. I will add this app version later.
Windows apps are compiled with MinGW gcc, and should work on WindowsXP.
ID: 172 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Conan
Avatar

Send message
Joined: 8 Sep 17
Posts: 23
Credit: 3,186,867
RAC: 10,775
Message 177 - Posted: 26 Nov 2017, 22:28:39 UTC
Last modified: 26 Nov 2017, 22:29:27 UTC

Thanks for this Daniel, great work and your an asset to the project.

Does the Win 32 XP app require a GPU? I installed the download and got the message that "App version needs Open CL and my GPU does not support it" (which it doesn't as it's an AMD/ATI 4800 type). I have not selected GPU for anything so why would that matter?

Still waiting for some work to download to my Linux machines to see the new speed up.

Thanks for your efforts

Conan
ID: 177 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Conan
Avatar

Send message
Joined: 8 Sep 17
Posts: 23
Credit: 3,186,867
RAC: 10,775
Message 178 - Posted: 27 Nov 2017, 0:05:53 UTC - in response to Message 177.  

Thanks for this Daniel, great work and your an asset to the project.

Does the Win 32 XP app require a GPU? I installed the download and got the message that "App version needs Open CL and my GPU does not support it" (which it doesn't as it's an AMD/ATI 4800 type). I have not selected GPU for anything so why would that matter?

Still waiting for some work to download to my Linux machines to see the new speed up.

Thanks for your efforts

Conan


UPDATE: Your the man Daniel, don't worry about my above comments, the OpenCL thing does not stop the app you compiled from working, so I wouldn't worry about it.
My Windows XP 32 bit machine has now processed it's first work units and they have validated as well, and in under 40 minutes.
My work on the Linux machines was taking over 6 hours, can't wait to get more work on them.

Thanks again

Conan
ID: 178 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [B@P] Daniel

Send message
Joined: 8 Sep 17
Posts: 63
Credit: 140,190,555
RAC: 28,448
Message 179 - Posted: 27 Nov 2017, 0:08:45 UTC - in response to Message 177.  

I have added Linux AVX512 app, and apps for Linux ARM and Linux AARCH64. ARM app was compiled on Odroid XU4 with ARM v7l CPU, I am not sure if it will work on earlier CPU versions - please try and let me know.

I also found that I measured time incorrectly - it turned out that I had checkpoint file created, and I measured time only for last part of calculations. Ooops! :) I have repeated my tests, and got following results. This also includes results for ARM app on Odroid XU4, and AARCH64 app on Odroid CU2:
Original app:
real    54m57.442s
user    54m55.481s
sys     0m0.346s

SSE2:
real    6m2.431s
user    6m0.451s
sys     0m0.030s

AVX:
real    5m45.740s
user    5m43.759s
sys     0m0.026s

AVX2:
real    5m24.624s
user    5m22.626s
sys     0m0.042s

Odroid XU4 - ARMv7 Processor rev 3 (v7l)
real    20m37.322s
user    20m35.665s
sys     0m0.155s

Odroid CU2 - AARCH64
real    26m45.051s
user    26m42.920s
sys     0m0.060s


As you can see, this time AVX app has clear advantage over SSE2 one. So this app version should stay.

AARCH64 app is slower that ARM one in this test, but on real WUs it is faster. Total runtime is about 3-4 hours on my devices.

Thanks for this Daniel, great work and your an asset to the project.

Does the Win 32 XP app require a GPU? I installed the download and got the message that "App version needs Open CL and my GPU does not support it" (which it doesn't as it's an AMD/ATI 4800 type). I have not selected GPU for anything so why would that matter?

Still waiting for some work to download to my Linux machines to see the new speed up.

Thanks for your efforts

Conan

No, it is a CPU app. Strange. Where do you see this message?
ID: 179 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Dr Who Fan
Avatar

Send message
Joined: 7 Sep 17
Posts: 15
Credit: 268,613
RAC: 60
Message 181 - Posted: 27 Nov 2017, 4:36:14 UTC - in response to Message 172.  

Thank you Daniel!

The Win_32_sse2 is running on my XP PC.
Never could get the stock apps to run on that PC.

Running the Win_64_sse2 on one PC. Will add it to rest of my 64-bit PC's over the next day or so.

ID: 181 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Conan
Avatar

Send message
Joined: 8 Sep 17
Posts: 23
Credit: 3,186,867
RAC: 10,775
Message 183 - Posted: 27 Nov 2017, 5:32:11 UTC - in response to Message 179.  

I have added Linux AVX512 app, and apps for Linux ARM and Linux AARCH64. ARM app was compiled on Odroid XU4 with ARM v7l CPU, I am not sure if it will work on earlier CPU versions - please try and let me know.

I also found that I measured time incorrectly - it turned out that I had checkpoint file created, and I measured time only for last part of calculations. Ooops! :) I have repeated my tests, and got following results. This also includes results for ARM app on Odroid XU4, and AARCH64 app on Odroid CU2:
Original app:
real    54m57.442s
user    54m55.481s
sys     0m0.346s

SSE2:
real    6m2.431s
user    6m0.451s
sys     0m0.030s

AVX:
real    5m45.740s
user    5m43.759s
sys     0m0.026s

AVX2:
real    5m24.624s
user    5m22.626s
sys     0m0.042s

Odroid XU4 - ARMv7 Processor rev 3 (v7l)
real    20m37.322s
user    20m35.665s
sys     0m0.155s

Odroid CU2 - AARCH64
real    26m45.051s
user    26m42.920s
sys     0m0.060s


As you can see, this time AVX app has clear advantage over SSE2 one. So this app version should stay.

AARCH64 app is slower that ARM one in this test, but on real WUs it is faster. Total runtime is about 3-4 hours on my devices.

Thanks for this Daniel, great work and your an asset to the project.

Does the Win 32 XP app require a GPU? I installed the download and got the message that "App version needs Open CL and my GPU does not support it" (which it doesn't as it's an AMD/ATI 4800 type). I have not selected GPU for anything so why would that matter?

Still waiting for some work to download to my Linux machines to see the new speed up.

Thanks for your efforts

Conan

No, it is a CPU app. Strange. Where do you see this message?

It was in the Event Log at the restart of BOINC, it is not a problem so I wouldn't worry about it.
Thanks again

Conan
ID: 183 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mmonnin

Send message
Joined: 8 Sep 17
Posts: 13
Credit: 7,594,452
RAC: 1,866
Message 186 - Posted: 2 Dec 2017, 2:07:33 UTC

Not sure it's working on my 1950x with AVX2 app. 34min into my 1st set of tasks and its only 55% done. My older 3770k with only AVX is 19min in and 66%.
ID: 186 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [B@P] Daniel

Send message
Joined: 8 Sep 17
Posts: 63
Credit: 140,190,555
RAC: 28,448
Message 187 - Posted: 2 Dec 2017, 9:46:53 UTC - in response to Message 186.  

Not sure it's working on my 1950x with AVX2 app. 34min into my 1st set of tasks and its only 55% done. My older 3770k with only AVX is 19min in and 66%.

It works, but slower than expected. Please try AVX version, it may be faster for you. Recently I read that PEXT instruction from BMI2 set is slow on AMD CPUs, and AVX2 app uses it in most performance-critical part. This can explain why app is so slow on AMD CPU.
https://www.reddit.com/r/Amd/comments/60i6er/ryzen_and_bmi2_strange_behavior_and_high_latencies/

Maybe AVX2 app without BMI instructions would be better here. I will take a look on this.
ID: 187 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mmonnin

Send message
Joined: 8 Sep 17
Posts: 13
Credit: 7,594,452
RAC: 1,866
Message 188 - Posted: 2 Dec 2017, 14:41:12 UTC - in response to Message 187.  
Last modified: 2 Dec 2017, 14:48:22 UTC

Not sure it's working on my 1950x with AVX2 app. 34min into my 1st set of tasks and its only 55% done. My older 3770k with only AVX is 19min in and 66%.

It works, but slower than expected. Please try AVX version, it may be faster for you. Recently I read that PEXT instruction from BMI2 set is slow on AMD CPUs, and AVX2 app uses it in most performance-critical part. This can explain why app is so slow on AMD CPU.
https://www.reddit.com/r/Amd/comments/60i6er/ryzen_and_bmi2_strange_behavior_and_high_latencies/

Maybe AVX2 app without BMI instructions would be better here. I will take a look on this.


Yeah, I guess I meant it wasn't working as well as expected. Seeing big numbers put up by others.

Point/CPU Sec went from 0.0454 average to 0.0613 with the AVX app. A good 35% improvement on the 1950x.

Great job and thanks for another optimized app for BOINC!
ID: 188 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [B@P] Daniel

Send message
Joined: 8 Sep 17
Posts: 63
Credit: 140,190,555
RAC: 28,448
Message 189 - Posted: 2 Dec 2017, 17:57:43 UTC

I found that AVX2 app for AMD can still use other BMI2 instructions, it should not use PEXT/PDEP only. I have created such app and uploaded to GitHub, it has "avx2nopext" in file name. Here are performance results from my Xeon Haswell. I added results for existing AVX and AVX2 apps for comparison. As you can see, new app is a bit faster that AVX. Please check if it is also faster on your machine.
AVX:
real    5m45.740s
user    5m43.759s
sys     0m0.026s

AVX2+BMI2:
real    5m24.624s
user    5m22.626s
sys     0m0.042s

AVX2+BMI2, without PEXT:
real    5m38.600s
user    5m36.622s
sys     0m0.022s


I also added NEON instructions to AARCH64 app, what improved app speed by ~20%. NEON instructions are always available on AARCH64, so I replaced existing non-NEON app with NEON one on GitHub.
AARCH64, no NEON:
real    26m45.051s
user    26m42.920s
sys     0m0.060s

AARCH64, NEON:
real    20m54.181s
user    20m52.180s
sys     0m0.070s

ID: 189 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mmonnin

Send message
Joined: 8 Sep 17
Posts: 13
Credit: 7,594,452
RAC: 1,866
Message 190 - Posted: 2 Dec 2017, 23:16:52 UTC

Trying it out now. Posting as a time reference between apps. :)
ID: 190 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Tom_unoduetre

Send message
Joined: 11 Oct 17
Posts: 1
Credit: 7,680
RAC: 129
Message 199 - Posted: 4 Dec 2017, 8:49:24 UTC

Thank you very much for this awesome app Daniel, running times of 10 times faster are fantastic.

Your app should be the standard app for this project!
ID: 199 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
jozef j
Avatar

Send message
Joined: 11 Sep 17
Posts: 10
Credit: 64,122,493
RAC: 385,296
Message 207 - Posted: 7 Dec 2017, 14:54:55 UTC

http://rake.boincfast.ru/rakesearch/top_hosts.php
There is daniel top host with 56 cores and 178,439.69 day RAC with linux
and when you compare everything else under them so it's an abnormal rise..
Of course i hope all hosts under, use daniel s good optimized app.. (like 88core use avx2 app)
But here the question arises whether it is really linux too good ..?
Or is there any optimization that is not accessible to the public?
ID: 207 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [B@P] Daniel

Send message
Joined: 8 Sep 17
Posts: 63
Credit: 140,190,555
RAC: 28,448
Message 208 - Posted: 7 Dec 2017, 16:36:08 UTC - in response to Message 207.  

http://rake.boincfast.ru/rakesearch/top_hosts.php
There is daniel top host with 56 cores and 178,439.69 day RAC with linux
and when you compare everything else under them so it's an abnormal rise..
Of course i hope all hosts under, use daniel s good optimized app.. (like 88core use avx2 app)
But here the question arises whether it is really linux too good ..?
Or is there any optimization that is not accessible to the public?

No, reason is different. RAC changes slowly, is is averaged over long period of time (something like few weeks). I started running early version of my app on this host about 3 weeks earlier before I created and officially released current version here. Because of this my host already have high RAC, while other ones still has to catch up. This difference in RAC should disappear within few next weeks.
ID: 208 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
jozef j
Avatar

Send message
Joined: 11 Sep 17
Posts: 10
Credit: 64,122,493
RAC: 385,296
Message 211 - Posted: 7 Dec 2017, 23:14:22 UTC

Thank you for answer.
I guessed too that is some remmaint rac credit or long run on one host.. i was bit trolling with this q.

but important; i also find on Ryzen s cpus is best only AVX app. but work good . my host s with ryzen cpu have small oveclock becouse of lack water cooling and chipset overheating,becouse this project and app heating chipset.
On intel cpu s are all app absolutly fantastic. proably that s why we have all badges now)) hope project add more. really like to see animals,even is this math.project. it is refreshing .. ))
But I am a little disappointed on TH 1950x I hope to you ,find some way to pull out as much as possible from this Cpu ..becouse "old father Moroz" was here ......))))
Interesting data from users would be how fast is intel 512bit task on some 7960x,7980x,,cpus
ID: 211 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [B@P] Daniel

Send message
Joined: 8 Sep 17
Posts: 63
Credit: 140,190,555
RAC: 28,448
Message 213 - Posted: 8 Dec 2017, 6:33:05 UTC - in response to Message 211.  

Thank you for answer.
I guessed too that is some remmaint rac credit or long run on one host.. i was bit trolling with this q.

No problem, it's your karma anyway ;)

but important; i also find on Ryzen s cpus is best only AVX app. but work good . my host s with ryzen cpu have small oveclock becouse of lack water cooling and chipset overheating,becouse this project and app heating chipset.
On intel cpu s are all app absolutly fantastic. proably that s why we have all badges now)) hope project add more. really like to see animals,even is this math.project. it is refreshing .. ))
But I am a little disappointed on TH 1950x I hope to you ,find some way to pull out as much as possible from this Cpu ..becouse "old father Moroz" was here ......))))

PEXT instruction is very slow on Ryzen, as I wrote above. Please try avx2nopext app version, it does not use it, and should be a bit faster than AVX one for you.

Interesting data from users would be how fast is intel 512bit task on some 7960x,7980x,,cpus

Good idea. I will prepare some script which will help to benchmark different app versions.
ID: 213 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
jozef j
Avatar

Send message
Joined: 11 Sep 17
Posts: 10
Credit: 64,122,493
RAC: 385,296
Message 214 - Posted: 8 Dec 2017, 20:06:29 UTC

PEXT instruction is very slow on Ryzen, as I wrote above. Please try avx2nopext app version, it does not use it, and should be a bit faster than AVX one for you.

I tried to a few days ago. but all end up immediately with a bug/error and then the project start blocking me from download new units.also change all my rest tasks in boinc.m to error task.. on ryzen 1700,1700x .. so i back to AVX after deatach project in boinc manager.. Soo i dont know..but i will do later new tests..))
ID: 214 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Skivelitis2
Avatar

Send message
Joined: 16 Nov 17
Posts: 3
Credit: 3,016,484
RAC: 15,293
Message 218 - Posted: 9 Dec 2017, 15:45:24 UTC

Could someone please explain the process for correctly unpacking the optimized app files in Linux? I have successfully downloaded and extracted the files to my desktop, but when attempting to place them in the rakesearch folder I hit a dead end. I must be going about this the wrong way. I am trying to use the same process as setting up a cc_config file and it is not working.

So far in Linux Mint Xfce 18.2:
(1) Download file
(2) Extract contents to desktop (couldn't figure out how to extract directly to the rakesearch folder as in Win 7)
(3) Tried using gksudo xed /var/lib/boinc-client/projects/rake.boincfast.ru_rakesearch/ to open the destination folder and add contents but no go.

Is the command wrong or do I need to add /home/skivelitis before /var /lib? Or as is most likely am I completely off-base? I have been using Linux for about a year now but only on dedicated number crunchers and am definitely a noob.
Thanks in advance.

ID: 218 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [B@P] Daniel

Send message
Joined: 8 Sep 17
Posts: 63
Credit: 140,190,555
RAC: 28,448
Message 223 - Posted: 9 Dec 2017, 19:24:14 UTC - in response to Message 218.  
Last modified: 9 Dec 2017, 19:27:20 UTC

Could someone please explain the process for correctly unpacking the optimized app files in Linux? I have successfully downloaded and extracted the files to my desktop, but when attempting to place them in the rakesearch folder I hit a dead end. I must be going about this the wrong way. I am trying to use the same process as setting up a cc_config file and it is not working.

So far in Linux Mint Xfce 18.2:
(1) Download file
(2) Extract contents to desktop (couldn't figure out how to extract directly to the rakesearch folder as in Win 7)
(3) Tried using gksudo xed /var/lib/boinc-client/projects/rake.boincfast.ru_rakesearch/ to open the destination folder and add contents but no go.

Is the command wrong or do I need to add /home/skivelitis before /var /lib? Or as is most likely am I completely off-base? I have been using Linux for about a year now but only on dedicated number crunchers and am definitely a noob.
Thanks in advance.

I do not use desktop on Linux, only shell :) Here are required commands to execute. You may have to adjust paths and URLs:
su -
cd /var/lib/boinc/projects/rake.boincfast.ru_rakesearch/
wget https://github.com/sirzooro/RakeSearch/releases/download/v1.0/rakesearch_linux_64_avx.tgz
tar zxvf rakesearch_linux_64_avx.tgz
systemctl restart boinc-client

Above commands are enough to download, unpack and install AVX app on CentOS 7. You may have to adjust them a bit for your Linux version. You may have BOINC in /var/lib/boinc-client/... dir, and its service may be called boinc instead of boinc-client. BTW, Boinc prints path to its dir in event log when it starts, you can look for it there.
ID: 223 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [B@P] Daniel

Send message
Joined: 8 Sep 17
Posts: 63
Credit: 140,190,555
RAC: 28,448
Message 224 - Posted: 9 Dec 2017, 19:39:41 UTC - in response to Message 214.  

PEXT instruction is very slow on Ryzen, as I wrote above. Please try avx2nopext app version, it does not use it, and should be a bit faster than AVX one for you.

I tried to a few days ago. but all end up immediately with a bug/error and then the project start blocking me from download new units.also change all my rest tasks in boinc.m to error task.. on ryzen 1700,1700x .. so i back to AVX after deatach project in boinc manager.. Soo i dont know..but i will do later new tests..))

Good to know that is does not work :) I suspect what may be wrong, but today I do not have access to my PC - I will do it tomorrow.
ID: 224 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · 3 · Next

Message boards : Number crunching : Optimized RakeSearch app


©2018 The searchers team