Optimized RakeSearch app for rank 9 (computations finished)

Message boards : Number crunching : Optimized RakeSearch app for rank 9 (computations finished)

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 7 · Next

AuthorMessage
Profile bcavnaugh
Avatar

Send message
Joined: 30 Nov 17
Posts: 12
Credit: 46,605,568
RAC: 15,938
Message 230 - Posted: 10 Dec 2017, 18:02:40 UTC - in response to Message 224.  
Last modified: 10 Dec 2017, 18:11:37 UTC

PEXT instruction is very slow on Ryzen, as I wrote above. Please try avx2nopext app version, it does not use it, and should be a bit faster than AVX one for you.

I tried to a few days ago. but all end up immediately with a bug/error and then the project start blocking me from download new units.also change all my rest tasks in boinc.m to error task.. on ryzen 1700,1700x .. so i back to AVX after deatach project in boinc manager.. Soo i dont know..but i will do later new tests..))

Good to know that is does not work :) I suspect what may be wrong, but today I do not have access to my PC - I will do it tomorrow.


AVX2 runs Great on my Ryzen 1800X Rig.
I did have to exit the Client Software for the new file and settings to be used, Read you config files before you exit. I do have SMT Off
But now we have no Tasks so we have to wait.
http://rake.boincfast.ru/rakesearch/results.php?hostid=1731&offset=0&show_names=0&state=4&appid=

Crunching@EVGA The Number One Team in the BOINC Community. Folding@EVGA The Number One Team in the Folding@Home Community.
ID: 230 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [B@P] Daniel

Send message
Joined: 8 Sep 17
Posts: 89
Credit: 375,708,268
RAC: 88,132
Message 237 - Posted: 11 Dec 2017, 22:06:59 UTC
Last modified: 11 Dec 2017, 22:51:58 UTC

I have fixed the avx2nopext app for WIndows, please try it again. Linux version was fine.

I also added NEON app version for ARM CPUs. It is about 22% faster than non-NEON one. Before installing it please check if your device supports NEON instructions - open /proc/cpuinfo file.and check if there is "neon" in "Features" line.

ARM:
real    20m37.322s
user    20m35.665s
sys     0m0.155s

ARM+NEON:
real    15m58.774s
user    15m57.060s
sys     0m0.080s


Edit: I have added test.tgz archive, which contains files needed to perform benchmark test. If you are using it, unpack this archive somewhere, copy rakesearch file to the same dir and run test.sh script.

It is also possible to test Windows apps. You need to install Cygwin, and then follow above steps. Please do not rename rakesearch.exe to rakesearch, Cygwin will be able to run it as-is.
Note: for some reason now Cygwin displays 0.000 as a user time, what is incorrect. It used to work properly when I was using Win7, I suspect that Win10 broke this.

Please post your results. I am especially interested how AVX512 app compares with other app versions.
ID: 237 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Bryan

Send message
Joined: 8 Sep 17
Posts: 3
Credit: 120,631,679
RAC: 0
Message 241 - Posted: 12 Dec 2017, 22:26:45 UTC

@Daniel, thanks for the optimized apps. You've made this project fun!

I'm using the SSE, AVX, and AVX2 apps for both Windows and Linux. They are working very well on Intel CPUs. :)
ID: 241 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
phd21

Send message
Joined: 13 Dec 17
Posts: 1
Credit: 0
RAC: 0
Message 243 - Posted: 14 Dec 2017, 22:45:24 UTC

Hi Daniel, and Anyone Else involved in this Optimized App,

If this application's code works as well as some are reporting, then would this code be helpful for the other "Boinc" projects, or only for this project? If so, can this code be integrated into the Boinc software?

Best regards,
Phil
phd21
ID: 243 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [B@P] Daniel

Send message
Joined: 8 Sep 17
Posts: 89
Credit: 375,708,268
RAC: 88,132
Message 244 - Posted: 15 Dec 2017, 6:53:08 UTC - in response to Message 243.  

Hi Daniel, and Anyone Else involved in this Optimized App,

If this application's code works as well as some are reporting, then would this code be helpful for the other "Boinc" projects, or only for this project? If so, can this code be integrated into the Boinc software?

Best regards,
Phil
phd21

This code is specific to this project, so it cannot be integrated directly with other projects or Boinc itself. However other projects may review all changes done by me, get familiar with optimization techniques used by me and then apply them to their apps.

I only wonder about ODLK project, it also works with Latin Squares. Maybe it could directly integrate some code.
ID: 244 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
JagDoc

Send message
Joined: 9 Dec 17
Posts: 5
Credit: 69,199,494
RAC: 4,997
Message 245 - Posted: 16 Dec 2017, 8:15:38 UTC - in response to Message 244.  

On my Odroid-XU4 i get this error:
../../projects/rake.boincfast.ru_rakesearch/rakesearch: error while loading shared libraries: libboinc_api.so.7: cannot open shared object file: No such file or directory
http://rake.boincfast.ru/rakesearch/results.php?hostid=1797
ID: 245 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [B@P] Daniel

Send message
Joined: 8 Sep 17
Posts: 89
Credit: 375,708,268
RAC: 88,132
Message 248 - Posted: 17 Dec 2017, 22:02:32 UTC - in response to Message 245.  

On my Odroid-XU4 i get this error:
../../projects/rake.boincfast.ru_rakesearch/rakesearch: error while loading shared libraries: libboinc_api.so.7: cannot open shared object file: No such file or directory
http://rake.boincfast.ru/rakesearch/results.php?hostid=1797

I have rebuilt ARM apps and now this lib is linked statically. Please download new app, it should work now.
ID: 248 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
JagDoc

Send message
Joined: 9 Dec 17
Posts: 5
Credit: 69,199,494
RAC: 4,997
Message 250 - Posted: 18 Dec 2017, 18:36:44 UTC - in response to Message 248.  

On my Odroid-XU4 i get this error:
../../projects/rake.boincfast.ru_rakesearch/rakesearch: error while loading shared libraries: libboinc_api.so.7: cannot open shared object file: No such file or directory
http://rake.boincfast.ru/rakesearch/results.php?hostid=1797

I have rebuilt ARM apps and now this lib is linked statically. Please download new app, it should work now.

Thank you so much.
Now the arm_v7l_neon app run on Odroid-XU4, Odroid-HC1 and Jetson-TK1 without problem.
ID: 250 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
LookAS

Send message
Joined: 6 Jan 18
Posts: 3
Credit: 1,747
RAC: 0
Message 271 - Posted: 6 Jan 2018, 21:19:13 UTC
Last modified: 6 Jan 2018, 22:00:32 UTC

Hi, I am trying running AVX512 app and it is not triggering my AVX512 offset set in BIOS on my i9-7920X cpu, it runs with offset for AVX2. Is this app really using AVX512?

edit: been observing it for a little longer and it occasionaly triggers AVX512 offset, but it is quite rare and only for a very short period of time.
ID: 271 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [B@P] Daniel

Send message
Joined: 8 Sep 17
Posts: 89
Credit: 375,708,268
RAC: 88,132
Message 273 - Posted: 7 Jan 2018, 10:54:43 UTC - in response to Message 271.  

Hi, I am trying running AVX512 app and it is not triggering my AVX512 offset set in BIOS on my i9-7920X cpu, it runs with offset for AVX2. Is this app really using AVX512?

edit: been observing it for a little longer and it occasionaly triggers AVX512 offset, but it is quite rare and only for a very short period of time.

Answer is more complicated. This app version in most performance-critical place uses new AVX512 instruction which works on old AVX registers. Beside this there are some places where memory blocks are copied, what uses AVX512 registers. However these copies are made rarely. This matches with what you are observing.

BTW, could you test performance of various app versions on your machine? In post linked below I wrote small instruction how to do this. I am mainly interested how AVX512 version compares with AVX2 one, I do not have any hardware to do such benchmark.
http://rake.boincfast.ru/rakesearch/forum_thread.php?id=39&postid=237
ID: 273 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
LookAS

Send message
Joined: 6 Jan 18
Posts: 3
Credit: 1,747
RAC: 0
Message 275 - Posted: 7 Jan 2018, 12:22:07 UTC - in response to Message 273.  

Sure, results for AVX2 and AVX512 on i9-7920X (offset for AVX2 is set to 4GHz, for AVX512 is set to 3.8GHz) under Windows 10:

AVX2
real 3m32,268s
user 0m0,000s
sys 0m0,000s

AVX512
real 3m40,743s
user 0m0,000s
sys 0m0,000s

Yes, times are correct. I could set the offset same for the benchmarking if it makes a difference later today.
ID: 275 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Stephen Uitti

Send message
Joined: 12 Nov 17
Posts: 7
Credit: 6,169,826
RAC: 0
Message 279 - Posted: 8 Jan 2018, 19:53:36 UTC - in response to Message 179.  

I downloaded rakesearch_linux_arm_v7l.tgz from github.
On a pi 3 (not overclocked), with Raspbian Stretch, with boinc loaded
sudo apt-get install boinc
I ran the boinc manager and added Rakesearch (by URL). I ignored the warning "this project may not have units for your CPU" (or whatever it says). I usually run my pi 3's headless. I suppose i could have done it with boinccmd. The boinc manager showed me what was going on a bit quicker.

I then installed the application:
# get a root shell
sudo bash
# extract the binary:
cd /var/lib/boinc-client/projects/rake.boincfast.ru_rakesearch/
tar xvf ~pi/rakesearch_linux_arm_v7l.tgz
# exit the root shell
exit

Stopping and starting the boinc manager didn't work, so i restarted the pi
sudo shutdown -r

I let it download a couple units, which executed in 5 to 6 hours each.

I chose the arm_v7l version as it is the one that i expected to work on the pi 3. I don't expect it to work on a pi 2. I have a pi 2 that runs Jessie, and i'll give it a try soon. I also have a pi zero w, and could give that a shot.

I don't expect the NEON version (rakesearch_linux_arm_v7l_neon.tgz) to work on a pi 3. I might give it a try and see. It might possibly work with a 64 bit OS. That would be nice to know for sure, one way or the other. It might work on a banana pi or a higher end droid. I don't have either of these.

The above process is more or less the same as on the x86, which was smooth for me.

I'm running the rakesearch_linux_64_sse2.tgz version on an AMD Phenom (running Linux Mint 13). It's not young enough to support AVX. I also have an AMD A8 also on Mint 13, which does have AVX. I haven't attempted to run that as yet.

I've only looked at Arm optimization a little bit. It looks complicated, and like a ton of work. In particular, getting the data to move in and out of the processor while the processor does the work looks difficult to get right. Daniel has clearly gotten it right, so it very likely was a ton of work. Thanks, very much.

Stephen.
ID: 279 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [B@P] Daniel

Send message
Joined: 8 Sep 17
Posts: 89
Credit: 375,708,268
RAC: 88,132
Message 280 - Posted: 8 Jan 2018, 21:17:00 UTC - in response to Message 279.  

I downloaded rakesearch_linux_arm_v7l.tgz from github.
On a pi 3 (not overclocked), with Raspbian Stretch, with boinc loaded
sudo apt-get install boinc
I ran the boinc manager and added Rakesearch (by URL). I ignored the warning "this project may not have units for your CPU" (or whatever it says). I usually run my pi 3's headless. I suppose i could have done it with boinccmd. The boinc manager showed me what was going on a bit quicker.

I then installed the application:
# get a root shell
sudo bash
# extract the binary:
cd /var/lib/boinc-client/projects/rake.boincfast.ru_rakesearch/
tar xvf ~pi/rakesearch_linux_arm_v7l.tgz
# exit the root shell
exit

Stopping and starting the boinc manager didn't work, so i restarted the pi
sudo shutdown -r

I let it download a couple units, which executed in 5 to 6 hours each.

I chose the arm_v7l version as it is the one that i expected to work on the pi 3. I don't expect it to work on a pi 2. I have a pi 2 that runs Jessie, and i'll give it a try soon. I also have a pi zero w, and could give that a shot.

I don't expect the NEON version (rakesearch_linux_arm_v7l_neon.tgz) to work on a pi 3. I might give it a try and see. It might possibly work with a 64 bit OS. That would be nice to know for sure, one way or the other. It might work on a banana pi or a higher end droid. I don't have either of these.

The above process is more or less the same as on the x86, which was smooth for me.

Good to hear that!~

If you want to check if your RPI supports NEON or not, please execute following command. If it will print something, it would mean that your CPU supports NEON instructions.

grep 'neon\|asimd' /proc/cpuinfo | head -1


I'm running the rakesearch_linux_64_sse2.tgz version on an AMD Phenom (running Linux Mint 13). It's not young enough to support AVX. I also have an AMD A8 also on Mint 13, which does have AVX. I haven't attempted to run that as yet.

I've only looked at Arm optimization a little bit. It looks complicated, and like a ton of work. In particular, getting the data to move in and out of the processor while the processor does the work looks difficult to get right. Daniel has clearly gotten it right, so it very likely was a ton of work. Thanks, very much.

Stephen.

Well, most of this complicated stuff is done by compiler :). I had to find proper intrinsics which will do what I need, and this was most complicated part for me. Beside this things are similar to SSE/AVX programming :)
ID: 280 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [B@P] Daniel

Send message
Joined: 8 Sep 17
Posts: 89
Credit: 375,708,268
RAC: 88,132
Message 281 - Posted: 8 Jan 2018, 21:26:47 UTC - in response to Message 275.  

Sure, results for AVX2 and AVX512 on i9-7920X (offset for AVX2 is set to 4GHz, for AVX512 is set to 3.8GHz) under Windows 10:

AVX2
real 3m32,268s
user 0m0,000s
sys 0m0,000s

AVX512
real 3m40,743s
user 0m0,000s
sys 0m0,000s

Yes, times are correct. I could set the offset same for the benchmarking if it makes a difference later today.

Thanks for results. This is interesting, I thought that AVX512 version would be faster a bit. I wonder if it is really slower, or it was some random execution time variation. If you execute test few times (e.g. 3 times), you will see that numbers are different each time. CPU load also influences results. Could you repeat these tests few times with BOINC suspended to confirm if AVX512 version is really slower instead of faster?
ID: 281 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
LookAS

Send message
Joined: 6 Jan 18
Posts: 3
Credit: 1,747
RAC: 0
Message 284 - Posted: 9 Jan 2018, 19:25:12 UTC - in response to Message 281.  
Last modified: 9 Jan 2018, 19:25:52 UTC

I run tests again several times with offset set the same for AVX2 and AVX512 to 0 =4300MHz and the results are:

AVX2
real 3m32,724s
user 0m0,000s
sys 0m0,015s

AVX512
real 3m25,637s
user 0m0,000s
sys 0m0,015s

AVX512 looks (and is) faster, but when interpolated it is basically the same as the last time with offset set to -3 =4000MHz, but now you can see it clock to clock.
BOINC and other CPU load intensive processes were suspended.

I am available for another test when needed.
Keep up the good work.
ID: 284 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [B@P] Daniel

Send message
Joined: 8 Sep 17
Posts: 89
Credit: 375,708,268
RAC: 88,132
Message 285 - Posted: 9 Jan 2018, 21:14:07 UTC - in response to Message 284.  

I run tests again several times with offset set the same for AVX2 and AVX512 to 0 =4300MHz and the results are:

AVX2
real 3m32,724s
user 0m0,000s
sys 0m0,015s

AVX512
real 3m25,637s
user 0m0,000s
sys 0m0,015s

AVX512 looks (and is) faster, but when interpolated it is basically the same as the last time with offset set to -3 =4000MHz, but now you can see it clock to clock.
BOINC and other CPU load intensive processes were suspended.

I am available for another test when needed.
Keep up the good work.

Thanks! These results looks reasonable, I was expecting something like this. Real WUs are about 6 times longer, so with AVX512 computations would complete about 40 seconds faster. PC running 24/7 would be able to complete 5 more WUs per core per day.
ID: 285 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
troosh

Send message
Joined: 12 Jan 18
Posts: 1
Credit: 1,467
RAC: 0
Message 286 - Posted: 12 Jan 2018, 17:08:37 UTC

SABRE Lite i.MX6 (armv7l+neon@996MHz):
ubuntu@viv2:~$ cat /proc/cpuinfo 
processor	: 0
model name	: ARMv7 Processor rev 10 (v7l)
Features	: swp half thumb fastmult vfp edsp neon vfpv3 tls vfpd32 
CPU implementer	: 0x41
CPU architecture: 7
CPU variant	: 0x2
CPU part	: 0xc09
CPU revision	: 10
...
processor	: 3
...
Hardware: Freescale i.MX6 Quad/DualLite (Device Tree)
Revision: 63012
Serial: 0000000000000000

ubuntu@viv2:~$ cat /sys/devices/system/cpu/cpufreq/interactive/hispeed_freq
996000
ubuntu@viv2:~$ uname -a
Linux viv2 3.14.28-11-boundary-9t6 #11 SMP PREEMPT Mon Jan 18 06:31:13 MST 2016 armv7l armv7l armv7l GNU/Linux


ubuntu@viv2:~/BOINC/rakesearch_linux_arm_v7l_neon$ ./test.sh 
Started RakeSearch test...

real39m34.509s
user39m32.500s
sys0m0.460s
Files result.txt and result-ok.txt are identical
ubuntu@viv2:~/BOINC/rakesearch_linux_arm_v7l_neon$ 
ubuntu@viv2:~/BOINC/rakesearch_linux_arm_v7l$ ./test.sh 
Started RakeSearch test...

real44m23.962s
user44m22.300s
sys0m0.390s
Files result.txt and result-ok.txt are identical
ID: 286 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
StyM

Send message
Joined: 26 Jan 18
Posts: 2
Credit: 16,579,090
RAC: 0
Message 305 - Posted: 27 Jan 2018, 8:33:23 UTC

is the optimized app now part of the official package ?
ID: 305 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [B@P] Daniel

Send message
Joined: 8 Sep 17
Posts: 89
Credit: 375,708,268
RAC: 88,132
Message 306 - Posted: 27 Jan 2018, 13:50:43 UTC - in response to Message 305.  

is the optimized app now part of the official package ?

Not yet, but this is in plans.

BTW, I am going to release new optimized app version soon. Stay tuned!
ID: 306 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Stephen Uitti

Send message
Joined: 12 Nov 17
Posts: 7
Credit: 6,169,826
RAC: 0
Message 327 - Posted: 12 Mar 2018, 16:00:59 UTC - in response to Message 280.  

Thanks Daniel. I grep'ed for sse2 on the Phenom, didn't think to grep for neon on the Arms.

It turns out that both the pi 2 and the pi 3 Arm processors support NEON. Both processor systems have completed units. The pi 2 and pi 3 systems have gotten credit for NEON units.
Pi zeros don't work with the accelerated apps. They error out right away. (I've turned them off.) One zero was running Jessie, and the other Stretch, but I'm sure it's the processor, not the OS.

I've verified that the AMD A8 is in fact running the AVX accelerated app, and is successful. It's about 20% slower than the Phenom II, which doesn't have AVX, and is running SSE2. It's not unusual for the A8 to run 20% faster or 20% slower than the Phenom II on different apps or benchmarks. I might try the SSE2 app on the A8. I time these by pasting 20 valid units stats into a spreadsheet, and averaging.

Stephen.
ID: 327 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 . . . 7 · Next

Message boards : Number crunching : Optimized RakeSearch app for rank 9 (computations finished)


©2019 The searchers team, Karelian Research Center of the Russian Academy of Sciences