Message boards :
Science :
Source code of the project application
Message board moderation
Previous · 1 · 2
Author | Message |
---|---|
Send message Joined: 8 Sep 17 Posts: 99 Credit: 402,603,726 RAC: 0 |
I have added 3 more commits today. With them in AVX app version MovePairSearch::MoveRows() takes ~61% of CPU time. AVX2 version is even faster, this function uses ~46% of CPU time. In that version Generator::Start() moved to 1st spot, as it takes ~48% of CPU time :) I saw that it also could benefit from bit operations, so I will optimize it too. I also added some AVX512 code, and I noticed that gcc generates few less instructions when this instruction set is enabled. App on hosts with newest CPUs should be even faster. However I am not sure if anyone here would be able to run it - new CPU is not enough, user also has to use OS which supports AVX512 - at least Linux with kernel 3.15, Windows 10 or Windows Server 2016 (I am not sure about Windows, maybe MS has decided to add support to some earlier versions too). There are few hosts with Xeon Gold on "Top hosts" list, but they use Windows Server 2012, so most probably they will not be able to use these new instructions. |
Send message Joined: 8 Sep 17 Posts: 99 Credit: 402,603,726 RAC: 0 |
Yesterday I thought that I am done with MovePairSearch::MoveRows(). I was wrong :). Today I applied two more optimizations, and AVX2 version of it is at ~43% of CPU now. One of them also opens way to vectorization with SSE2, I will apply it later. |
Send message Joined: 11 Aug 17 Posts: 645 Credit: 22,418,140 RAC: 12,723 |
Daniel, I made a small workunit on ~ 38 millions of squares and about 30 minutes of work for unoptimized application. During the processing, a pair of the ODLS must be finded as in sample result.txt. Thank you for work! |
Send message Joined: 8 Sep 17 Posts: 99 Credit: 402,603,726 RAC: 0 |
Daniel, I made a small workunit on ~ 38 millions of squares and about 30 minutes of work for unoptimized application. During the processing, a pair of the ODLS must be finded as in sample result.txt. Thanks! I will try it. |
Send message Joined: 8 Sep 17 Posts: 99 Credit: 402,603,726 RAC: 0 |
I have problem with this small workunit - MovePairSearch::Read throws exception in first do/while loop. I also found that Generator::Start at the end has line "if (newSquare.Matrix[keyRowId][keyColumnId] == Square::Empty && cellId < 0)". Looks that it is enough to check if "cellId < 0", otherwise app could read data before buffer. Could you confirm if I can remove 1st part of this condition? |
Send message Joined: 8 Sep 17 Posts: 99 Credit: 402,603,726 RAC: 0 |
I have found why this sample file did not work for me - it had windows line endings. After converting it to unix format it started working. I did small benchmark on my machine using this sample workunit and got following results. I tried to benchmark original app, but I do not have it. I tried to attach new machine, but server stopped sending new work to me - looks that I have too many tasks which are pending validation (>2k). SSE2: real 1m26.704s user 1m24.704s sys 0m0.004s AVX: real 1m27.987s user 1m25.985s sys 0m0.005s AVX2+BMI2: real 1m20.868s user 1m18.872s sys 0m0.003s Current app version is probably about 4 times faster than original, and most WUs are completed in less than hour. How is your code review? Probably it is time to start public beta, as I started having problems with too many pending tasks. BTW, Some extra optimizations for Generator::Start may be possible, but this may affect data stored in checkpoint file (and potentially in input file too). I am checking this this. So far everything is backward-compatible and looks that it works fine, I do not have any invalid results from my app. |
Send message Joined: 11 Aug 17 Posts: 645 Credit: 22,418,140 RAC: 12,723 |
Daniel, hello! I reviewed the code. It was very interesting for me and I would like to understand it (and "mechanics" of commands like AVX and SSE) more deeper. And your application works fast, correct and found many ODLS. Could you to share this application and post a new thread in Number crunching (for example) with information about it? After this will to publish the news with a link to you thread. Great work! Thank you! |
Send message Joined: 11 Aug 17 Posts: 645 Credit: 22,418,140 RAC: 12,723 |
I tried to attach new machine, but server stopped sending new work to me - looks that I have too many tasks which are pending validation (>2k). Daniel, could you try to attach a new machine now? I generated a new bunch of tasks, may be the problem was in number of workunits without results on you machines. (For example, if any task in queue has a pair on your hosts). |
Send message Joined: 8 Sep 17 Posts: 99 Credit: 402,603,726 RAC: 0 |
I tried to attach new machine, but server stopped sending new work to me - looks that I have too many tasks which are pending validation (>2k). I tried to downloaded new tasks on already attached ones, and got some. I have run test WU with official app, and got following results. Looks that AVX2 app process this sample WU 10 times faster :) Original: real 13m29.530s user 13m27.579s sys 0m0.027s Daniel, hello! I have added new post here: http://rake.boincfast.ru/rakesearch/forum_thread.php?id=39. Please make sure you check all commits on my branch, there are few commits there which are not linked from this thread. Here is summary of my changes: - use bits and bitwise operations to track elements of square; - do not calculate things which are not necessary. Usually this means calculating things once before loop, calculating them as late as possible (this was usual case here), or in place where it make most sense (check exit condition after decrementing variable, instead of every loop iteration); - simplify algorithm - on "step forward" only set new values, and revert to previous state on "step backward"; - use SSE/AVX to calculate things in parallel; - use metaprogramming (templates), there was one place where "if" could be moved before whole function. Regarding SSE/AVX, unfortunately there are no good tutorials on the net, I only saw few simple ones how to sum or multiply elements two arrays. Most of my knowledge comes from studying various examples on stackoverflow and other sites which appeared in Google. And of course Intel Intrinsics Guide, it contains detailed description of every intrinsic: https://software.intel.com/sites/landingpage/IntrinsicsGuide/#. You can also look on Wikipedia pages about SSE and AVX, they contain summary of things added in every new instruction set. |
Send message Joined: 8 Sep 17 Posts: 99 Credit: 402,603,726 RAC: 0 |
Hi, I have few question about things which are not clear for me: 1. Is Generator::keyValue field needed? I checked few WUs and it always was set to -1 there. I wonder if this field and code which uses it can be removed; 2. Can I assume that every new WU at the beginning will have Generator::cellId set to 0? 3. Can I assume that every new WU at the beginning for every cell in Generator::path will have all corresponding values (bits) in Generator::cellsHistory set to 1 (i.e. all numbers from 0 to 8 can be potentially used for cell)? 4. How do you check if result is valid - are you checking if files are binary identical, or examine contents more closely? I wonder what will happen if for some WU app will find two square pairs, but report them in different order than now. Will server be able to handle this? Or do I need to sort pairs in such case to make result "canonical"? |
Send message Joined: 8 Sep 17 Posts: 99 Credit: 402,603,726 RAC: 0 |
One more question: do you always generate WUs with all values on diagonals set? I checked few WUs and noticed this. I wanted to optimize Generator::Start by processing diagonal elements before non-diagonal ones and save some cycles consumed by processing of 'primary' and 'secondary' variables in latter part, but now I wonder if this part of code could be eliminated completely. |
Send message Joined: 11 Aug 17 Posts: 645 Credit: 22,418,140 RAC: 12,723 |
Hello! Hi, This class member is required for store a value of cell[keyRowId][keyColumnId] at which the generator of diagonal squares must stop. It is a workunit border. With current partitioning of squares on workunits, keyValue is always -1. 2. Can I assume that every new WU at the beginning will have Generator::cellId set to 0? Yes. Because cellId is an index of current cell in path - array Generator::path. But while working cellId changed, of course. You can see this situation in any checkpoint file. 3. Can I assume that every new WU at the beginning for every cell in Generator::path will have all corresponding values (bits) in Generator::cellsHistory set to 1 (i.e. all numbers from 0 to 8 can be potentially used for cell)? In cellsHistory[rowId][columnId] the generator marks values, that used in current visit on this cell - Matrix[rowId][columnId]. When generator make a rollback to previous cell - it clear the history of cell. When generator go "into cell", cell history is clean, but real set of values, available for inserting into cell determine by flags in columns[value][columnId] and rows[rowId][value] arrays and cellsHistory, of course. 4. How do you check if result is valid - are you checking if files are binary identical, or examine contents more closely? I wonder what will happen if for some WU app will find two square pairs, but report them in different order than now. Will server be able to handle this? Or do I need to sort pairs in such case to make result "canonical"? By binary equivalency. But applications use an identical Generator::path[] (that supplied in workunit file) and computations must produce the identical squares sets in identical order. One more question: do you always generate WUs with all values on diagonals set? I checked few WUs and noticed this. I wanted to optimize Generator::Start by processing diagonal elements before non-diagonal ones and save some cycles consumed by processing of 'primary' and 'secondary' variables in latter part, but now I wonder if this part of code could be eliminated completely. In workunits of current search all diagonals - primary and secondary completely filled. Excluding of if (rowid == columnId) and if (rowId == Rank - 1 - columnId), if I understand you right - may be a good idea. Thank you for attention! |
Send message Joined: 8 Sep 17 Posts: 99 Credit: 402,603,726 RAC: 0 |
Thanks for answers! Hello! Do you have any plans to start using non-negative values of keyValue in the future? I wonder if you could use longer path prefix instead, and remove this field completely. 2. Can I assume that every new WU at the beginning will have Generator::cellId set to 0? I asked these two questions because I wonder if I can assume a "clean slate" state at the beginning, i.e. only bits corresponding to cells in constant path prefix are changed, and all other would be changes by app during WU processing. I would like to use cellsHistory to store cell value candidates in similar way like in MovePairSearch::MoveRows. If in new WU cells on path would be partially processed, this would complicate things for me. 4. How do you check if result is valid - are you checking if files are binary identical, or examine contents more closely? I wonder what will happen if for some WU app will find two square pairs, but report them in different order than now. Will server be able to handle this? Or do I need to sort pairs in such case to make result "canonical"? OK, so GPU app would have to take care of this (I started working on it). It would run few hundreds of generators (or maybe thousands?), each of them processing its part of search space. They would run in parallel, so order of results is no longer predetermined. One more question: do you always generate WUs with all values on diagonals set? I checked few WUs and noticed this. I wanted to optimize Generator::Start by processing diagonal elements before non-diagonal ones and save some cycles consumed by processing of 'primary' and 'secondary' variables in latter part, but now I wonder if this part of code could be eliminated completely. Yes, this is what I meant. Generator is 2nd most time-consuming function (~40% of total time), so elimination of these checks would speedup everything. |
Send message Joined: 11 Aug 17 Posts: 645 Credit: 22,418,140 RAC: 12,723 |
Daniel, hello! Do you have any plans to start using non-negative values of keyValue in the future? I wonder if you could use longer path prefix instead, and remove this field completely. In this search,I think keyValue will always be -1, because current partition global tasks on workunits is practical. [quote]2. Can I assume that every new WU at the beginning will have Generator::cellId set to 0? Yes. Because cellId is an index of current cell in path - array Generator::path. But while working cellId changed, of course. You can see this situation in any checkpoint file. I asked these two questions because I wonder if I can assume a "clean slate" state at the beginning, i.e. only bits corresponding to cells in constant path prefix are changed, and all other would be changes by app during WU processing. I would like to use cellsHistory to store cell value candidates in similar way like in MovePairSearch::MoveRows. If in new WU cells on path would be partially processed, this would complicate things for me. Yes. All cell in the Generator::path must be "clean" at start, because them form a work for Generator. And only these cells modify by Generator during work. And none cells from first line, diagonals and Matrix[1][0] (that not included in path) - cannot be changed by it. OK, so GPU app would have to take care of this (I started working on it). It would run few hundreds of generators (or maybe thousands?), each of them processing its part of search space. They would run in parallel, so order of results is no longer predetermined. Very interesting! Maybe set a number for each GPU thread by and place results into order of them? (But will have to abandon from checkpoints). Yes, this is what I meant. Generator is 2nd most time-consuming function (~40% of total time), so elimination of these checks would speedup everything. Ok. Thank you very much Daniel! |
Send message Joined: 11 Aug 17 Posts: 645 Credit: 22,418,140 RAC: 12,723 |
Hello! Now I test a new application under Linux and want to compile it under Windows, using MinGW, but get an error: $ make C:\MinGW\bin\g++.exe -O3 -ftree-vectorize -ID:\GitHub\boinc -ID:\GitHub\boinc/lib -ID:\GitHub\boinc/api -LD:\GitHub\boinc/api -LD:\GitHub\boinc/lib -c main.cpp In file included from D:\GitHub\boinc/api/boinc_api.h:24:0, from main.cpp:5: D:\GitHub\boinc/lib/boinc_win.h:164:21: fatal error: winhttp.h: No such file or directory #include <winhttp.h> ^ compilation terminated. make: *** [main.o] Error 1 I use a makefile: BOINC_DIR = D:\GitHub\boinc BOINC_API_DIR = $(BOINC_DIR)/api BOINC_LIB_DIR = $(BOINC_DIR)/lib COMPILER = C:\MinGW\bin\g++.exe CXXFLAGS += -O3 -ftree-vectorize \ -I$(BOINC_DIR) \ -I$(BOINC_LIB_DIR) \ -I$(BOINC_API_DIR) \ -L$(BOINC_API_DIR) \ -L$(BOINC_LIB_DIR) PROGRAM = rakesearch all: $(PROGRAM) clean: rm -f $(PROGRAM) *.o $(PROGRAM): main.o Square.o Generator.o MovePairSearch.o $(BOINC_LIB_DIR)/libboinc.a $(BOINC_API_DIR)/libboinc_api.a $(COMPILER) $(CXXFLAGS) -o $(PROGRAM) main.o Square.o Generator.o MovePairSearch.o -pthread -lboinc_api -lboinc Square.o: Square.cpp $(COMPILER) $(CXXFLAGS) -c Square.cpp Generator.o: Generator.cpp $(COMPILER) $(CXXFLAGS) -c Generator.cpp MovePairSearch.o: MovePairSearch.cpp $(COMPILER) $(CXXFLAGS) -c MovePairSearch.cpp main.o: main.cpp $(COMPILER) $(CXXFLAGS) -c main.cpp In Visual Studio 2015 the compilation is normal. I must recompile BOINC libraries under MinGW? |
Send message Joined: 8 Sep 17 Posts: 99 Credit: 402,603,726 RAC: 0 |
Hi, Please use MinGW compiler shipped with Cygwin. Makefiles created my me automatically uses it when you pass option MinGW=1 when calling make. BTW, tou will need to install both 32 and 64-bit versions of Cygwin (in separate directories), as 64-bit version does not have 32-bit libs needed for linking final app. You will also need to recompile BOINC libs as you wrote. This is a bit tricky, as you have to use boinc/lib/Makefile.mingw instead of one generated by configure script. You can also use my ones, you can download them from here: https://bitbucket.org/sirzooro/boinc-stuff/downloads/. |
©2024 The searchers team, Karelian Research Center of the Russian Academy of Sciences