/******************************************************************* * * DESCRIPTION: Class JackyDebugStream * * AUTHOR: Qi (Jacky) Liu * * EMAIL: mailto://liuqi@sce.carleton.ca * * DATE: Sept, 2005 * *******************************************************************/ #ifndef __JACKY_DEBUG_STREAM_H #define __JACKY_DEBUG_STREAM_H /** include files **/ #include #include //define this macro to switch on jacky-debug-mode //#define JACKY_DEBUG //define this to switch to the new version of NC, FC, ParallelRoot, ParallelProcessor //also for revision done in the kernel (like outputQ, secondary rollback, event compare function, //etc.), for revision done in the model state hierarchy at the CD++ level (to correctly save model //states) #define JACKY_REVISION //define this macro to generate the debugging files for inputQ details //#define JACKY_PRINT_INPUTQ //define this macro to solve the synchronization problem in TD & ID cells #define JACKY_SYNC //define this macro to solve the rollback problem of FileQueue #define JACKY_FILEQ //define this macro to rollback locally defined variables in processors //(function rollbackProcessorVariables() in TimeWarp, ParallelProcessor, and ParallelNodeCoordinator) //The AtomicCellState also make a copy of the state variables defined in this class in the operator= function #define JACKY_STATE //define this macro to solve exceptional situations during rollback //check detailed description in ParallelNodeCoordinator::receive( const BasicExternalMessage *msg) //"Jacky Note: Nov. 2, 2005" in pncoord.cpp //1. flag "skipStateSaving" in TimeWarp, TimeWarp::executeSimulation() //2. removeStragglerEvent(BasicEvent *toRemove, int id) in LTSFInputQueue //3. peekNextEventToBeExecuted(), hasUnprocessedCollectMessageAt() in ParallelNodeCoordinator (OBSOLETE!) // Nov. 5, 2005 //Note: flag "skipStateSaving" & removeStragglerEvent() is used along with macro JACKY_RB_EVENT by the // NodeCoordinator for One-Straggler / Multiple-Straggler conditions. #define JACKY_RB_EXCEPTION //define this macro to solve rolling back to "time before zero" problem //1. TimeWarp::findPositionFromState() //2. StateManager::restoreState(VTime restoreTime) //3. setCurrentToHead() in SortedListOfEvents //4. ParallelMainSimulator::afterInitialize() is not invoked in ParallelMainSimulator::run(). // It is now called from within LogicalProcess::allRegistered() //5. ParallelMainSimulator::afterInitialize() is made PUBLIC in ParallelMainSimulator //Nov. 7, 2005 //#define JACKY_RB_ZERO //define this macro to solve rollback problems in NC::receive(DoneMessage) [One-Straggler / Multiple-Straggler] //1. sendMinTimeEventsToFC(), sendMinTimeNCBagMsgsToFC(), findUnprocessedCollectORInternalMessageAfter() // in ParallelNodeCoordinator //2. new version of ParallelNodeCoordinator::receive(const DoneMessage &msg) //3. findAllStateWithOutputPos() in StateManager #define JACKY_RB_EVENT //define this macro to slove the problem of "event-jump" and "dummy" state //1. add flag "externalEventFlag" in message.h, along with function isEvent() and new constructor //2. add flag "TWExternalEventFlag" in pmessage.h, along with function derivedFromExternalEvent() and constructor //3. set isEvent() in ParallelNodeCoordinator::sendMinTimeEventsToFC() and sendMinTimeNCBagMsgsToFC //Nov. 22, 2005 #define JACKY_EVENT_JUMP //define this macro to skip state saving after processing an (X) message in ParallelNodeCoordinator, //ParallelFlatCoordinator, and ParallelProcessor //The idea is that we do not save the current state onto the stateQ after processing an (X) message, //which will put the (X) message into the NCMessageBag or MessageBag. When we processing an (X) message, //we will not change any data in the current state except that the lVT time is updated to the recvTime //of the received (X) message by the kernal automatically. Therefore, the state saved after processing //an (X) message will have the same content as those previous states (saved after processing a (D), (*), //(I), or (@) message) in terms of outputPos, lastChange/nextChange, etc. Skipping state-saving after //processing these (X) messages can give us the following benefits: //1. reduced memory requirements //2. faster execution for rolling back stateQs //3. help in "Multiple-Straggler" situations for the NodeCoordinator //4. avoid inconsistent data in the saved state: Since the receive(X) functions will not update the last // and next change time, the saved state may have different lVT time and lastChange time. For example, // after processing an (X) message with recvTime = 3000 by the FC at time 2000, the saved state will // have lVT = 3000, but lastChange = 2000! //Nov. 23, 2005 #define JACKY_X_SKIP_SAVING //define this macro to ensure that portInTransition function has a higher priority than localTransition //function in the TD/ID cell model. (modification in tdcell.cpp / idcell.cpp) //The problem happens when the simulation is run on multiple machines. There are potentially multiple //rounds of message passing at a given simulation time. In this case, the value based on the portInTransition //function may be overwritten (changed back) by the value evaluated from lagging messages form the cell's //neighbors, which, in effect, offsets the influence of the external events and may result in less output //from the simulation. //Ex: //At time 1000, the state (value) of a cell is 1. There is an input coming in with a value of 1 at this time. //That is, this input will schedule an output of 1 at time 2000 (suppose that the default delay is 1000 time //units and the default portInTransition is used). However, at the same time (1000), the value evaluated from //the local transition function using the current neighborhood values is 0, i.e. the cell should schedule a //state change from 1 to 0 (thus output 0) at time 2000 as well. In this case, the cell should output 1 instead //of 0 at time 2000. //Solution: //A flag should be added to the TransientValue to indicate whether the scheduled output is derived from an external //event (input to the cell) or from neighborhood values. If the flag is TRUE, i.e. derived from input, we should //not overwrite it in later rounds. // [2006-01-06] #define JACKY_PORTIN_FIRST //define this macro to solve the problem of using multiple state variables in TD/ID cells in distributed simulation. //Multiple state variables can be defined in the TD or ID cells as a StateVars object, which is a map //defined in atomcell.h. Since the multiple-round message passing mechanism exists in the distributed simulation at //any given simulation time, and during each of these rounds, one of the local transition rules will be evaluated ( //so does the associated {state variable clause} in that rule), errors in the values of these state variables can //accumulated! //Ex: //Initial cell space is: // 0 0 0 // 1 0 0 // 1 0 0 //The local transition rules defined in the MA file are: //rule : 1 {$var := $var + 1; } 100 { (0,0) = 1 and (trueCount = 3 or trueCount = 4) and even($var) } [RULE 1] //rule : 1 {$var := $var + 1; } 100 { (0,0) = 0 and trueCount = 2 and odd($var) } [RULE 2] //rule : 0 {$var := $var + 1; } 100 { t } [RULE 3] //State variable "var" has an initial value of 1. //In the cell space, cell(0,0) has value 0 and the trueCount is 2. //In Round 0, all neighbor values from cell(0,0)'s LOCAL neighbors have arrived at the cell. Thus, for cell(0,0), the //trueCount = 0, it evaluate RULE 3, and the resulting cell value is still 0, but var is increased by 1 in RULE 3, so //var = 2 after Round 0; Then, we suppose all values from remote neighbors have arrived at the cell in Round 1. So the //trueCount = 2, which is the correct and final result for this time. Therefore, cell(0,0) should evaluate RULE 2 in //Round 1. However, var is now 2, which is an even number (due to the increment done in Round 0). Hence, it evaluates //RULE 3 again rather than correctly evaluate RULE 2! //Solution: //At the beginning of Round 0, we need to save all the values of state variables, and then at the beginning of each //following round, we need to reset the state variables to their initial values saved at the beginning of Round 0. //Doing so will clear any wrong calculation that has been done in the previous round, which makes sure that we can //always use the initial values of state variables for the current time and the latest values from neighbors. // [2006-01-10] #define JACKY_STATE_VAR //define this macro to modify the logic of externalFunction() in the TransportDelayCell class //Previously, we add logic for "change further" [Queue replacement] & "change back" [Queue removal] to adapt to the //multiple-round message passing scenarios in distributed simulation. The replacement / removal actions search the //Queue for an element (scheduled internal event) that has the same "output time" [the time when the internal event //on the Queue should be sent out] & the same "output port" [the output port on the cell model through which the event //will be sent]. This is not accurate because the multiple-round message passing happens at a given simulated time. //Therefore, it is the "schedule time" [the time when the internal event is inserted onto the Queue] that should be //the same throughout the multiple rounds. //Ex: //According to the previous logic, at time 100 the cell schedule an internal event on the queue with an output time of //200. That is, the cell inserts the event at time 100 and the event will be output (by the outputFunction) and erased //(by the internalFunction) at time 200. However, further external events could arrive before time 200. Say, an external //event arrives at the cell at time 150, and creates another internal event that has an output time of 210, i.e. this //new internal event has a schedule time of 150 and an output time of 210. This new event should be inserted onto the //queue according to the semantics of the transport delay. However, the logic will identify this as a "change further" //since N != P & P != T & N != T, and try to do a replacement. Of course, we know the replacement action will fail since //no element on the Queue has an output time of 210. (it intends to replace the element with an output time of 200). //Solution: //Add a field for each element on the Queue to record its "schedule time". So the queue element has a structure as this: // pair< VTime, pair< pair< string, Real >, VTime > > // i.e. pair< outputTime, pair< pair< outputPortName, outputValue >, scheduleTime > > //In the externalFunction, we take "change further" and "change back" action based on the "outputPortName" & "scheduleTime" //If an element with the same "outputPortName" & "scheduleTime" could not be found, we will insert the internal event on //the Queue. In this case, we know that the schedule time of this event has changed, so iti is not a multiple-round //message passing scenario. Rather, it is a new event scheduled at a different time and should be output anyway. // [2006-01-11] #define JACKY_TDCELL //define this macro to solve the problem in TDCell //Problem is: On multiple machines & variable delays, simulation result is different from those on a single machine! //So far, on a single machine, result is the same as the lopez's version. #define JACKY_IDCELL //define this macro to include the log file names for the NCs and FCs on all machines in the index log file //see ParallelMainSimulator::showLogInfo() #define JACKY_LOG_FILE //define this macro to debug the drawlog utility, just generating debug information //see drawlog.cpp //#define JACKY_DRAWLOG //define this macro to modify the cout "GVT: XX:XX:XX:XXX" message shown on the console #define JACKY_SHOWGVT //define this macro to avoid the "rollback to zero" problem. //This problem is hard to solve by previous mechanism (JACKY_RB_ZERO), i.e. give the InitialState //enough information to restart simulation on that machine and simultaneously avoid "rollback thrashing". // //1. LP::synchronizeInitialization() is added to the LP [LogicalProcess.hh/cc] // -> calls CommMgrInterface::barrierSynchronize() [CommMgrInterface.hh/cc] // -> calls CommPhyInterface::physicalBarrier()=0 [CommPhyInterface.hh/cc] // ->implemented in CommPhyMPI::physicalBarrier []CommPhyMPI] // -> invoke MPI_Barrier(MPI_COMM_WORLD) // Using the MPI Barrier to synchronize all LPs at a special execution point! // //2. when the (D) message corresponding to the (@) message at time 0 is received at the NC, it calls // synchronizeInitialization() in its receive(DoneMessage). // At this time, outputFunction() in all simulators have been called, and (Y) messages have been sent out // NCs should have received these (Y) messages from other remote NCs. If there are still some lagging (Y) // messages with time 0 in the network, the NCs have a chance to get them in the following loops for sending // and processing the (*) message. // //In this way, we can make sure that all remote (Y) messages will be received by all destination NCs at LVT 0, //thus, prevent the problem of "rollback to zero"! #define JACKY_SYNC_TIME0 //define this macro to solve the "multiple-straggler" problem of the NC. //After the rollbacks in the "multiple-straggler" situation, the NC is restored to the previous time when the previous //(D) message is received from the FC. Before the NC resends (X) messages in the NCMessageBag, it needs to advance the //LVT in its current state, which is now a copy of the previous state, to the time of the (X) messages in the NCMessageBag. // //This creates a "BreakPoint" in the NC's stateQ! // //Later on, if an anti-message for these stragglers in NCMessageBag is received in the NC, the last (D) message //with recvTime < time of the anti-message needs to be unprocessed!!! This will not done by the TimeWarp kernel since //the recvTime of the last (D) message is less than the recvTime of the anti-messages, and that (D) message has already //been processed. // //Solution: //1. NC's receive(DoneMessage) -> // a) advance the LVT in the current state to the time of the (X) messages in NCBag before resending these messages // b) mark the state to be saved after resending (X) and (*)/(@) messages to be a "break-point" state, set the flag // "breakpoint" to TRUE in TimeWarpBasicState class // //2. TimeWarpBasicState.hh/.cc -> // a) flag "breakpoint" is added to TimeWarpBasicState class. // This flag is to mark a state as a break-point on the NC's stateQ. By default, this flag is false. // When multiple-straggler happens in NC's receive(Done), NC will set this flag in its current state to true. // Then this current state will be saved by the StateManager. // This state is a break point because: it has lVT = the time of (X) straggler messages in NCBag > // the recvTime of the DoneMessage pointed by its inputPos. Also, its outputPos points to the (X) and // (*)/(@) messages resent by the NC with recvTime = the time of (X) straggler messages in NCBag; // its inputPos points to the previous (D) message received by the NC BEFORE the rollbacks since // the rollbacks has restored the state to this one. // b) All copy and assignment functions in TimeWarpBasicState.cc are modified to copy this added flag. // //3. StateManager.cc -> // a) saveState() : // this function is modified to reset the "breakpoint" flag in the current state of the NC to FALSE // The flag in NC's current state is marked to TRUE in its receive(D), after saving the current state // onto the stateQ, the StateManager will reset this flag in the NC's current state to FALSE so that // following states saved on the stateQ will not be marked as "breakpoint" // b) removeStatesAfterCurrent() : // this function will remove all states with LVT >= rollbackTime during the rollback mechanism. However, // we need to check whether the 1st state to be removed is a breakpoint state. If that is the case, special // operations need to be done before we remove it. These operations include: // i> check whether this is the NC's stateQ. Operations should only be done for the NC. Only NC may have // break-point states on its stateQ. // ii> if the 1st state to be removed is a break-point state on the NC's stateQ, we need to unprocess the // DoneMessage (sent from the FC to this NC at the previous time) pointed by the "inputPos" of this // break-point state. // iii> we need to adjust the currentObj[ncLocalId] and currentPos of the inputQ so that this unprocessed // DoneMessage will be executed just after the rollbacks by the LTSFScheduler! // These operations are done by LTSFInputQueue::reExecuteImmediately() // //[2006-02-02] "Break-Point State on NC's StateQ" #define JACKY_NC_BP_STATE //define this macro to solve the problem of having messages with different recvTime in the NCBag during Multiple-Straggler //situations. This is to solve the Multiple-Straggler problem along with JACKY_NC_BP_STATE, JACKY_RB_EVENT. // //If after resending the (X) NCBag messages, there are still messages in the NCBag, NC needs to clear these messages //from the NCBag and unprocess these (X) events on the InputQueue. //To do this, we must keep track of the address of the BasicEvents on the inputQ from which those (X) messages in the //NCBag are derived. Note: the address of the (X) messages in the NCBag is NOT the address of the original BasicEvent! //The (X) messages are created by getMessage(), which extracts infomation from the BasicEvent and allocates new (X) //message objects! //1. A multimap is used in the NC to keep track of the address of the BasicEvents. Elements in this NCBagReference are // 1-to-1 correspongding to the (X) messages in the NCBag. That is, if there is an (X) message in the NCBag with recvTime // of 1000, there is an element pair <1000, BasicEvent*> in the NCBagReference for this (X) message. NCBagReference // elements are inserted in ParallelProcessor::executeProcess(). In NC, whenever we remove (X) messages in the NCBag, // we also remove the corresponding pairs in the NCBagReference. //2. NC::receive(DoneMessage) will check whether the NCBag is empty after resending all (X) messages with the minTime // during multiple-straggler situation. If the NCBag is not empty, we will do: // a) call inputQ.resetLVTArrayTime(minTime, ncLocalId) to reset the lVTArray[ncLocalId] to the current minTime // so that the LVT of the NC is recovered to the time after processing the straggler (X) messages with the minTime. // b) process the remaining (X) messages in the NCBag grouped by their recvTime. Get the corresponding BasicEvent* // from the NCBagReference. Call inputQ.unprocessXEvent(pxEvent, ncLocalId) to unprocess the BasicEvent, then // remove the (X) messages and the elements from the NCBag & NCBagReference respectively. //At the end, the NCBag & NCBagReference will be empty! //Summary: NC inserts a new time frame for the minTime, but it can only take care of ONE insertion at a time. Other // time frames indicated by the remaining (X) messages in the NCBag with recvTime > minTime will be cleared. // These future time frames will be taken care of later. //[2006-02-03] #define JACKY_NCBAG_POINTERS //define this macro to directly get the state we want to delete from the tail of the NC's stateQ. //The original algorithm in the NC's multiple-straggler situation is like this: // 1. remove all straggler (X) messages the NC have sent out at the minTime // 2. find the futureCIEvent // 3. from this futureCIEvent, find all states on NC's stateQ with outputPos = futureCIEvent // There should be only one state with outputPos = futureCIEvent, and this state should be the tail of the stateQ // 4. find all out-going messages that the NC have sent out just before this state is saved. // 5. find external events from these out-going messages and rollback these external events. // 6. remove all these out-going messages from NC's outputQ and FC's inputQ // 7. remove the state found in step 3. // //If this macro is defined, the state will not be found by searching the stateQ with condition outputPos = futureCIEvent. //Rather, we simply take the tail of the NC's current stateQ as this state to be deleted, and test whether its outputPos //is equal to the futureCIEvent or not. //This will be more efficient! //[2006-02-07] #define JACKY_NC_GET_STATE_FROM_TAIL //define this macro to solve the problem when the stateFound during multiple-straggler has been garbage //collected and stateFound->outputPos = NULL. //we will use a set of new functions to find the future events, their corresponding Contianers on NC's outputQ, //and the events derived from external events for rolling back the EventList //New functions are: // 1. list< BasicEvent* > findAllFutureEvents( const BasicEvent* futureCIEvent ); // -> function to get all future event on the InputQ based on the futureCIEvent found // // 2. list< BasicEvent* > findMessagesDerivedFromExternalEventsFromBasicEvents( const list< BasicEvent* >& ); // -> function to find the BasicEvents derived from external events from all of the future events // // 3. list< Container* > findAllOutputMessagesForFutureEvents( const list< BasicEvent* >& ); // -> function to find all Containers on NC's outputQ for removing them. If the outpurQ is empty, this // function returns an empty list; otherwise, it returns all Containers on NC's outputQ //[2006-02-13] #define JACKY_NC_GET_OUTPUT_MSG_FROM_futureCIEvent //define this macro to do message type-based state saving //Previously, only skip state saving after processing (X) messages. //Now, we have this skip-state-saving scheme: // 1. NC & FC -> only save state after processing (D) messages // 2. Simulator -> only save state after processing (*) messages //[2006-02-23] #define JACKY_MSG_TYPE_BASED_STATE_SAVING //define this macro to write small amount of debugging infomation to files //#define JACKY_WRITER //define this macro to delete (destroy) the "stateFound" in NC::receive(D) in Multiple-Straggler situation //[2006-03-27] #define JACKY_DESTROY_STATEFOUND //define this macro to delete (destroy) the messages created for debugging purpose //[2006-04-14] #define JACKY_DESTROY_MESSAGE //define this macro to collect statistics about the number of messages ([+] & [-]) during the whole simulation on //a specific LP. // 1. add 6 "unsigned long" in LTSFInputQueue, namely: // positiveMsgNumber -> total number of [+] messages received // negativeMsgNumber -> total number of [-] messages received (number of implosions) // rollbackNumber -> total number of rollbacks happened // positiveStragglerMsgNumber -> total number of [+] straggler messages that cause rollbacks // negativeStragglerMsgNumber -> total number of [-] messages that cause rollbacks // totalNumberOfEventsRolledBack -> total number of events unprocessed during rollbacks // // 2. these variables are updated in function LTSFInputQueue::insert() and miniListUnprocessMessages() // 3. add 6 functions in LTSFScheduler to retrieve these variables in the inputQ // 4. at the end of the simulation, each LP will get the values of these variables from the LTSFScheduler // and print them to stdout. //[2006-03-31] #define JACKY_COUNT_MESSAGE_NUMBER //define this macro to collect statistics about the number of state saved/skipped, the number of events executed/ //coasted-forward during the whole simulation on a specific LP. // Define 7 static data member in class BasicTimeWarp [see BasicTimeWarp.hh for explanations] // 1. totalNumberOfStatesSaved -> StateManager::saveState() // 2. totalNumberOfStatesSkipped -> TimeWarp::executeSimulation() [skipStateSaving == true] // 3. totalNumberOfEventsExecuted -> TimeWarp::executeSimulation() // 4. totalNumberOfEventsCoastedForward -> TimeWarp::coastForward() // // 5. totalTimeForStateSaving (nano-seconds) -> StateManager::saveState() // 6. totalTimeForEventExecution (nano-seconds) -> TimeWarp::simulate() // 7. totalTimeForCoastForward (nano-seconds) -> TimeWarp::rollback() // 8. totalTimeForRollback (nano-seconds) -> TimeWarp::recvEvent() // Note: total rollback time includes the time spent on coast forward operations! // That is, time 8> includes time 7> // 9. initializationTime -> ParallelMainSimulator::run() // 10. dormantTime -> LogicalProcess::simulate() // // 11. jackyWatch -> stopwatch for measuring 5, 6, 7 time // 12. jackyRBWatch -> stopwatch for measuring 8 time // 13. initializationWatch -> stopwatch for measuring 9 time // 14. dormantWatch -> stopwatch for measuring 10 time // // 15. totalNumberOfLazyHit -> TimeWarp::lazyCancel() // 16. totalNumberOfLazyMiss -> TimeWarp::lazyCancel() // 17. msgSize -> TWExternalMessage::getTWExternalMessageSize() // // Note: The original approach for measuring rollback time in Warped kernel is not correct! Our new approach is shown // in TimeWarp::recvEvent() Jacky Note [2006-04-05] //[2006-04-02] #define JACKY_STATISTICS //define this macro for running simulations on a single machine. //[Danger!] If this macro is turned on, NO state will be save throughout the simulation, and NO outgoing [+] message //will be saved on the outputQ //See TimeWarp::executeSimulation() & TimeWarp::sendEventUnconditionally() //[2006-04-06] // #define JACKY_SINGLE_MACHINE //only for single node!!! //define this macro to fix the contents of the NC's outFileQ during multiple-straggler situation. //Here is the problem: //After resending the (X) messages in the NCBag that represent the 1st pending TimeSlice, the NC is going to remove & //undo the remaining (X) messages in the NCBag. For example, there are 2 remaining TS in the NCBag as follows: // t0 t1 t2 t3 t4 // [FC (D) received] // [NC speculative computation gets t4 as minTime] //Then, when a (D) from the FC with time t4 is received at the NC, the NC sends a straggler NC@t4-X->FC@t1 to the FC, //which triggers RB on all the processors on this node. The rollback time for these processors are: // rollbackTime for FC = t1 // rollbackTime for NC, Simulators & Root = t4 //Since there was no messages received by the FC, the Root, and all the Simulators between t0 and t4, a rollback time //of t1 or t4 makes no difference for the FC, the Root, and the Simulators in terms of their outFileQueues. //The outFileQs in the FC, the Root, and the Simulators will be rolled back to the end of t0. That is, all FileData //with time > t0 are removed from the outFileQs of the FC, the Root, and the Simulators. Therefore, the contents of //the outFileQs in the FC, the Root, and the Simulators are recovered to that have been saved at the end of t0. // //However, the rollbackTime for the NC is t4, so FileData with time >= t4 in the NC's outFileQ are removed. What have //been left in the NC's outFileQ are: // a) log for messages received before (including) t0, i.e. in time period [0, t0] // b) log for messages received in period (t0, t4) //When we remove pending TS of time t2 and t3 from NCBag and undo these (X) messages, we must also remove the log data //for t2 and t3 from the NC's outFileQ!!! //Define this macro for fixing the NC's outFileQ in this situation. //[2006-04-26] // Jacky Note: This is the rationale for creating the single log file for the NC (rather than the FC) on each // node when "One Log File per Node" strategy is used. We cannot use the log file for the FC as the // single log file on this node since the log FileData for the (X) received by the NC at time t1 // will be deleted (remember the rollbackTime for the FC is t1)! // If the single log file is created in the NC, all FileData with time >= t4 are removed during the // rollback. This is OK for the FC, the Root, and the Simulators since, as we have explained, a rollback // time of t1 or t4 makes no difference for them in terms of their outFileQueues. Also, this is OK for // the NC as well since the NC's rollbackTime is t4. Log FileData for all the pending TS (t1, t2, and t3) // are left, and then data at time t2 & t3 are removed by the NC after the rollback. #define JACKY_NC_MULTI_STRAGGLER_FILEQ //------------------------------------------------------------------------------------------------------------------------ //Note: These two macros should be set in ONE of the following ways: // 1. unset both // -> this will be the original version of creating log & output files // 2. set JACKY_SINGLE_LOG_FILE_LY only // -> the origianl version enhanced by -LY option (only create log file for FCs) // 3. set JACKY_UNIQUE_LOG only // -> new version of creating single log file per node // This new version includes the enhancements done in point 2. So there is no // need to set both macros at the same time! //define this macro to reduce the overhead for creating the log files //if -LY is given on the command line, we only create log files for the FCs. Only one log file will be created //on each machine. This is enough for the drawlog utility! //That is, if we have -l(logfieName) -LY, then only create a single log file for the FC to log all (Y) messages received. //[2006-04-20] //#define JACKY_SINGLE_LOG_FILE_LY //define this macro to implement the "One Log File per Node" strategy. We only create a single log file on each node! //This will improve the performance greatly! //Here is the setting for the command line options: // 1. if -lLogFileName is given // -> only create a single log file for the NC, ALL processors on the node write log to it.. // // 2. if both -lLogFileName and -LY are given // -> only create a single log file for the FC and only (Y) messages received by the FC are logged. // // 3. if -lLogFileName is given along other -L options // -> only create a single log file for the NC, all processors on the node write log to it. // //Note: The drawlog utility checks the "index log file" for [logfiles] group, and then searches a definition for // the log files that it should use. Previously, the definition name is "ParallelFlatCoordinator", i.e. all // log files created for the FCs are used to draw the space. Now, we change the definition name to "LogFileNames", // which can be the log files created for the NCs or for the FCs depending on the command options. // // a) ParallelMainSimulator::showLogInfo() -> index log file only show the log files created in the following form: // [logfiles] // LogFileNames : name.logNC/FCid1 name.logNC/FCid2 ... // b) ParalleProcessor::initialize() -> if log type = 8, already done // if log type != 8, only create log file for the NC on this node // c) ParallelRoot::initialize() -> should not create log file altogether! only create the output file if necessary //[2006-04-27] #define JACKY_UNIQUE_LOG //----------------------------------------------------------------------------------------------------------------------- /**************************************************************************************************************** ** Following are macros for debugging optimization algo. in the Time Warp kernel *****************************************************************************************************************/ //define the macro to debug ONE_ANTI_MESSAGE optimization //An SimuObj sends only ONE [-] message to each distinct destination SimuObj //On the receiver side of the [-] message, the SimuObj deletes all [+] messages on its inputQ where: // msg->sender = [-]msg.sender && // msg->sendTime >= [-]msg.sendTime //i.e. delete all messages that come from the same source and have the same sendTime as the [-] message //Note: originally, (msg->sendTime >= [-]msg.sendTime) is not checked, but this is necessary since in the following //scenario, we want to unprocess the msgs rather than deleting them. //NC@100 -@-> FC@200 (1) //now, [-]msg is NC@200 -X-> FC@200 (2) //when the FC gets msg (2), it will delete all msgs from the NC and have sendTime >= 200. Also, the FC needs to //unprocess msg (1) rather than deleting it! // //Note: this macro should only used within the scope of "ONE_ANTI_MESSAGE" defined in config.hh //[2006-03-06] //#define JACKY_ONE_ANTI_MESSAGE //debugging InfreqStateManager //Problems: // 1. The initial state is saved for all events at time 0. So if later we roll back to the end of time 000, // the initial state will be restored. This does not work! (Due to the same reason that we have to use a // MPI Barrier after the Collect Phase at time 0) // --> the initial state is saved in function LP::allRegistered() // a> timeWarpInit() => Processor's initialize() is called // b> saveState() => save the initial state // Solution: // 1> Add function resetTimeAtLastCallAndPeriodCounter() in class InfreqStateManager // -> This function in the InfreqStateManager will reset "timeAtLastCall" to INVALIDTIME (-1), and // periodCounter to 0. Thus, the state derived from the 1st event at time 0 will be saved! // Also add function getTimeAtLastCall() & getPeriodCounter() in InfreqStateManager.hh // // 2> Call simArray[i].ptr->state->resetTimeAtLastCallAndPeriodCounter() after saving the initial states // in LogicalProcess.cc. // // 2. The original TW::coastForward() function suppose we always save the 1st state for a TimeSlice, but this is // not the case for msg type-based state saving. For msg type-based state saving scheme, the state saved is the // one after processing the 1st event with specific message type. For example, the state saved for NC in a Time // Slice is the one after processing the 1st (D) message from the FC; same thing for the FC. For Simulators, // the state saved is the one after processing the 1st (*) message. Thus, there are multiple events that have // been executed in a time slice before the state is saved. // TW::coastForward() calls find() to get the 1st event on the inputQ with recvTime >= timeRestored, and then // skips the 1st event from the resulting findPos of the inputQ miniList for the simuObj in question based on // the assumption that the restored state was saved after processing the 1st event at this recvTime. // This is wrong when used with message type-based state saving scheme. // In general, we should skip all events with recvTime = timeRestored and before current state's inputPos! // Note: after calling state->restoreState(), the current state is a copy of the saved state on the stateQ. Its // inputPos points to the event for which the restored state has been saved. And, we can get the inputPos directly // from the current state. NO need to search the inputQ and skip one event. This is more efficient. // --> TimeWarp::coastForward() is modified to use this method to get the 1st event (the event just behind // current->inputPos) for coast forwarding. // // 3. There are processor variables that are not put in the states such as NCMessageBag and MessageBag. These Bags // contain pointers to Message objects derived from (X) messages (BasicEvents) on the inputQ. These Message objs // will be deleted after sending out as (X) messages to receivers downstream. For the coast forward phase, we // need to recover the status of the Bags so that they contain the same Message objects as we were in the original // execution! This is important for the FC since we are now rolled back into a TimeSlice (previously Time Slices // are atomic unit in terms of rollbacks, we only roll back to the END of a Time Slice), and the FC's Bag may // contian some (X) messages in it. For Simulators, since we always restore to the state that was saved after // processing a (*) message at the simulator. And sfter processing a (*) message, the Simulator's MessageBag is // always cleaned. So for Simulators, we only need to clean its MessageBag before doing coast forward. // NodeCoordinators will use InfreqStateManager as a normal StateManager and will not do coast forward during // rollbacks, thus we don't need to do anything to recover its processor variables. // --> a> Define function TimeWarp::recoverProcessorVariables(), by default, it's empty! // -> function to recover processor variables before coast forward operations // FC => overrides this function to recover its MessageBag from the FCBagReference // defined in its state; // In order to do this, we need to define a FCBagReference in ParallelFlatCoordinatorState. // it's a "bagref" just like the one for NC. When we restore the FC's state, the FC may // have some (X) msgs in its MessageBag. The restored state may be saved for FC in the // following situations: // i> If there is Collect Phase -> FC saves state after processing the 1st (D) msg // received from its 1st imminent cell. // --> Its MessageBag may contain (X) msg if there were (X) messages sent with // the (@) message from the NC. In this case, the FC first saves the (X) // in its MessageBag and then sends (@) to its dependants. The (X) messages // are sent and cleaned when the (*) msg is received from the NC at [R0] // ii> If there is no Collect Phase -> FC saves state after processing the 1st (D) msg // received from its 1st cell that receives the (*) // msg. // --> If NC sends (X) & (*) to the FC at [R0], then these (X) msgs will be // sent to the cells & cleaned in the FC's MessageBag. In this case, there // is no (X) msg in the FC's MessageBag when the state is saved. // Simulator => Before CoastForward, we need to recover its MessageBag to the status of after saving // the restored state. We know that simulators save state for a given time only after // they receive the 1st (*) msg from the FC. This will always happen in [R0] of the // transition phase. The simulators' internal/external functuion are called, and its // MessageBag is cleaned. Thus, their MessageBag should always be EMPTY when the state // is saved. Therefore, clean MessageBag for all Simulators before coast forward should // be enough! Always cleans its MessageBag before coasting forward; // NC => does not provide an implementation for this function, inherits the empty function // definition from TimeWarp class. // // b> Define fucntion TimeWarp::getSuppressMessage() // -> function to get the "suppressMessage" flag in the TW object so that functions at PCD++ level can do // different things during coasting forward & normal processing // NC uses this function to avoid MPI_Barriers during coast forward; ParallelProcessor uses this function // to skip writing to the log files during coast forward; ParallelRoot uses this function to skip writing // to the output files during coast forward! // // 4. The MPI_Barrier in NC::receive(D) should not be called during coast forward (CF) since in the CF phase, no // message will be sent out! The NC can find out whether it is in a CF phase using TimeWarp::getSuppressMessage() // function. // // 5. We should not write log lines into log files during coasting forward! // --> ParallelProcessor::writelog() check getSuppressMessage(), if it returns 2 (i.e. in CF phase), then do not // insert data to the fileQ // // 6. The Root Coordinator should not write lines into the output files during coasting forward! // --> ParallelRoot::receive(const BasicOutputMessage *msg) only writes to output files when getSuppressMessage() // is not 2! // // 7. NC's states are strategic for the simulation on that node! We should not skip any NC state except those under // the condition of Message Type-based State Saving Strategy! // --> define flag "doStateSaving" in InfreqStateManager, and modify its saveState() so that if this flag is true // the InfreqStateManager acts just like a normal StateManager. // When the NC is cretaed, it sets: // InfreqStateManager::doStateSaving = true => so that the NC can use its InfreqStateManager // just like a StateManager // InfreqStateManager::statePeriod = -1 => so that the NC does't do coast forward during rollbacks // Thus, we have a hybrid two-level strageties for state saving! // // 8. GVTManager::gcollect() function should not do fossil collection based on the gVT time since gVT does not take // coast-forward operations into account. GVT calculation can only ensure that any message ([+]/[-]) that we may // receive will have a recvTime >= current gVT. But, if we need to rollback upon reception of a [+] straggler or // [-] anti-message with recvTime = gVT, we will restore to a state saved before the current gVT, and the resulting // coast-forward operations need to re-execute those events betweem the time of the restored state to the gVT! // Strategy A) => // Do fossil collection using the MIN time among all states that might be restored if we receive a // [+] straggler or [-] anti-message with recvTime = gVT. That is, we find the LAST states saved // before the current gVT on all local simuObj's stateQueues. Use an array of VTime for all local // simuObjs to hold their LAST states' lVT (VTime* lastStateLVT). The MIN time among these states' // lVT should be the time we used to do fossil collection. // 1> If the gVT is INFINITY (...), we should skip this calculation of MIN and do fossil collection // using the INFINITY gVT directly since the whole simulation has finished. // 2> If the gVT is Zero, we should not do fossil collection at all because i) we cannot find a state // saved before time Zero and ii) there is no need to fossil collect the inputQ/outputQ/stateQ at // time 0. // --> this makes sure that if we get a straggler [+] message or a [-] message with recvTime = gVT, // we can always restore to the LAST state with lVT < gVT and do coast forward operations properly. // Problems for this strategy: there are Master/Slave Coordinators exist in our simulation, they don't // participate in the actual simulation! In this strategy, we need to mark these Coordinators so that // they don't participate in the "MIN of LAST state LVT" calculation. This is done using a flag defined // in BasicTimeWarp.hh (skipCalculateLVT). If this flag is true, then the entry in array "lastStateLVT" // for this simuObj is set to INFINITY. They will have no impact on the MIN calculation. Also, the // ParallelRoot should participate in the calculation only if the model TOP has output ports. If the // TOP model has no output ports, then ParallelRoot should not participate in the MIN calculation as // well since it will always have a LAST state's LVT of ZERO! // Therefore, we need to skip MIN calculation for: // a) All ParallelCoordinators. These are dummy processors that do not actually participate in the // simulation. (ParallelMCoordinator & ParallelSCoordinator) // --> this is done by adding a flag "skipCalculateLVT" in BasicTimeWarp, which is initialized // to false. In ParallelProcessorAdmin, after those coordinators have been created, we set // this flag to true; for all other types of processors, this flag remains false. // In GVTManager::gcollect(), we check this flag before invoking // TW::getLastStateTimeBefore(gVT). If this flag is true, the lVT value of the LAST state // before current gVT is set to PINFINITY. // b) For ParallelRoot, if there are output to the environment, the root will receive (Y) messages // from the Node Coordinators during the simulation, and it saves states using InfreqStateManager // policy. In this case, the ParallelRoot should participate in calculating "lowestTime" in // GVTManager::gcollect(); On the other hand, if there is no output to the environment, i.e. the // TOP model has no output port, the root will never receive (Y) message from the node coordinators. // Thus, no state will be saved for the root, and it will not be involved in rollback or coast // forward operations. Thus, we should set its "skipCalculateLVT" flag to true in parsimu.cpp. // // Strategy B) => use different fossil collection for each local simuObj! // In GVTManager::gcollect(), after filling all LAST state's lVT into array "lastStateLVT", we simply // use these last state's lVT times as the fossil collection times for the corresponding simuObjs! // Array "lastStateLVT" is initialized to ZERO, and we don't need to consider those Master/Slave/Root // coordinators since if no state is saved for them, their LAST state lVT is always Zero, and thus no // fossil collection on these simuObjs natually! //[2006-03-13] //#define JACKY_INFREQ_STATEMANAGER //define the state period for the infrequent state manager //This macro should always be defined even when StateManager is used for compilation!!! #define STATE_PERIOD 1 //define ONE of the following macro to use strategy A or B for the GVTManager //set this macro ONLY when InfreqStateManager is used //#define JACKY_FOSSIL_COLLECTION_STRATEGY_A //Note [2006-03-27]: Strategy B may have a problem! //For example: current gVT = 4000, Root's last state lVT = 000, and NC's last state lVT = 3000. So Root will NOT do //fossil collection at this time. However, NC will do fossil collection for gtime = 3000 according to this strategy. //When NC does garbage collection on its inputQ, it will remove all BasicEvent* with recvTime < 3000, and these //BasicEvent objects are destroyed! However, this will affect the outputQ of the Root!!! //BasicEvent "Root -I-> NC @ 000" is saved in a Container on the Root's outputQ. When NC destroys the BasicEvent in //this Container, Root will have an invalid element on its outputQ! //We can only use Strategy A, which is less aggressive! //#define JACKY_FOSSIL_COLLECTION_STRATEGY_B //define this macro to fix Lazy-Cancellation Strategy //In the original algo., all messages sent before rollback, i.e. those messages on the outputQ after the outputPos of //the restored state, are move to the lazyCancelQ. These messages include those sending to 1) remote NCs, i.e. inter-LP //messages; and 2) local simuObjs, i.e. intra-LP messages. After moving these messages to the lazyCancelQ, the rollback //is deemed as commplete and reexecution of the unprocessed messages begins. // //Problem: //The "sensitivity of output message" is high among intra-LP messages. The reexecution of those unprocessed events will //fail since there may be false messages on the simuObj's inputQ, which can only be annihilated during the ordinary //rollback process (the message annihilation of aggressive cancellation). Therefore, we cannot simply unprocessing the //messages on the inputQ and starting reexecution immediately. //However, those inter-LP messages is suitable for lazy cancellation. These (X) messages sending to remote NCs are //generated by executing a (Y) message by the local NC. They will be either resent during the execution after rollbacks //on the local node or annihilated by sending [-] messages to those remote NCs. // //Solution: // 1. We should implement lazy cancellation only for those inter-LP messages. That is, lazy cancellation is // implemented at the LP level rather than the simuObj level. Actually, only the NC needs the lazyCancelQ // since it is the only guy in charge of inter-LP communication at the PCD++ layer. That is: // a) inter-LP message -> lazy cancellation (moving to lazyCancelQ) // b) intra-LP message -> aggressive cancellation (sending [-] message immediately) // 2. Redefine lazyCmp() functions for each type of TWMessages in pmessage.h // In fact, based on point 1, we can see only TWExternalMessage needs this lazyCmp() function //[2006-03-28][2006-04-14] //#define JACKY_LAZY_CANCELLATION #ifdef JACKY_DEBUG class JackyDebugStream{ private: static JackyDebugStream *instance; ostream* jackyDebugStream; JackyDebugStream(); public: virtual ~JackyDebugStream(); static JackyDebugStream &Instance(); ostream &Stream(); }; #endif #ifdef JACKY_WRITER class JackyWriter{ private: static JackyWriter *instance; ostream* jackyWriter; JackyWriter(); public: virtual ~JackyWriter(); static JackyWriter &Instance(); ostream &Stream(); }; #endif #endif