Copyright (c) Hyperion Entertainment and contributors.
Difference between revisions of "DMA Resource"
m (Jamie Krueger moved page DMA Engine to DMA Resource: Renaming to reflect that this page covers the fsldma.resource API and not just a discussion of the underlined DMA Engine.) |
|||
(7 intermediate revisions by the same user not shown) | |||
Line 2: | Line 2: | ||
Jamie Krueger, BITbyBIT Software Group LLC<br/> |
Jamie Krueger, BITbyBIT Software Group LLC<br/> |
||
− | Copyright (c) 2019 Trevor Dickinson<br/> |
+ | Copyright (c) 2019, 2021 Trevor Dickinson<br/> |
Used by permission. |
Used by permission. |
||
Line 53: | Line 53: | ||
<syntaxhighlight> |
<syntaxhighlight> |
||
− | BOOL CopyMemDMA( |
+ | BOOL CopyMemDMA( CONST_APTR pSourceBuffer, APTR pDestBuffer, uint32 lBufferSize ); |
− | BOOL CopyMemDMATags( |
+ | BOOL CopyMemDMATags( CONST_APTR pSourceBuffer, APTR pDestBuffer, uint32 lBufferSize, uint32 TagItem1, ... ); |
− | BOOL CopyMemDMATagList( |
+ | BOOL CopyMemDMATagList( CONST_APTR pSourceBuffer, APTR pDestBuffer, uint32 lBufferSize, const struct TagItem *tags ); |
</syntaxhighlight> |
</syntaxhighlight> |
||
Line 61: | Line 61: | ||
In theory any ''contiguous'' block of memory from 1 Byte up to 4GB in size may be transferred using any one of these calls. In practice the DMA hardware can directly accept memory blocks up to FSLDMA_MAXBLOCK_SIZE (or 64MB - 1 Byte). Therefore any data blocks greater than FSLDMA_MAXBLOCK_SIZE will automatically be sent to the DMA hardware in a series of smaller chunks. For maximum efficiency transfer sizes should be at least 256 Bytes in size and be an even multiple of 64 Bytes. Odd sizes are handled by the hardware but will degrade performance. |
In theory any ''contiguous'' block of memory from 1 Byte up to 4GB in size may be transferred using any one of these calls. In practice the DMA hardware can directly accept memory blocks up to FSLDMA_MAXBLOCK_SIZE (or 64MB - 1 Byte). Therefore any data blocks greater than FSLDMA_MAXBLOCK_SIZE will automatically be sent to the DMA hardware in a series of smaller chunks. For maximum efficiency transfer sizes should be at least 256 Bytes in size and be an even multiple of 64 Bytes. Odd sizes are handled by the hardware but will degrade performance. |
||
− | |||
− | If you are transferring many large blocks of data in series and wish to manually send them to the FslDMA API's memory copy functions one block at a time, then setting your block sizes to FSLDMA_OPTIMAL_BLKSIZE (or 64 MB - 64 Bytes) will provide the fastest transfer speeds will the least amount of CPU overhead. |
||
==== Further reference - The fsldma.resource AutoDoc ==== |
==== Further reference - The fsldma.resource AutoDoc ==== |
||
Line 71: | Line 69: | ||
<syntaxhighlight> |
<syntaxhighlight> |
||
+ | #include <stdio.h> |
||
− | # <interfaces/fsldma.h> |
||
+ | #include <stdlib.h> |
||
− | // Obtain the fsldma.resource |
||
+ | #include <amiga_compiler.h> |
||
− | struct fslDMAIFace *IfslDMA = IExec->OpenResource(FSLDMA_NAME); |
||
+ | #include <exec/types.h> |
||
− | if ( NULL != IfslDMA ) |
||
+ | #include <proto/exec.h> |
||
+ | #include <dos/dos.h> |
||
+ | #include <interfaces/exec.h> |
||
+ | #include <interfaces/fsldma.h> |
||
+ | |||
+ | // fsldma.resource Example 1, "Blocking" DMA Copy |
||
+ | // test using Virtual memory areas |
||
+ | int main() |
||
{ |
{ |
||
+ | struct fslDMAIFace *IfslDMA = NULL; |
||
− | // Allocate a couple of test buffers (fill the source with data) |
||
− | uint32 lSize = |
+ | uint32 lSize = 0; |
− | CONST_APTR pSrc = |
+ | CONST_APTR pSrc = NULL; |
− | + | APTR pDest = NULL; |
|
− | + | BOOL bGotResources = FALSE; |
|
+ | |||
− | TAG_DONE); |
||
+ | // Obtain the fsldma.resource |
||
− | APTR pDest = IExec->AllocVecTags(lSize, |
||
+ | IfslDMA = IExec->OpenResource(FSLDMA_NAME); |
||
− | AVT_Type, MEMF_SHARED, |
||
+ | |||
− | AVT_ClearWithValue, 0, |
||
+ | // Set the size of our test buffers |
||
− | TAG_DONE); |
||
+ | lSize = FSLDMA_OPTIMAL_BLKSIZE; // About 64 MB |
||
− | if ( (NULL != pSrc) && (NULL != pDest) ) |
||
+ | |||
+ | // Allocate a test source buffer (fill with some data) |
||
+ | pSrc = IExec->AllocVecTags(lSize, |
||
+ | AVT_Type, MEMF_SHARED, |
||
+ | AVT_ClearWithValue, 0xB3, |
||
+ | TAG_DONE); |
||
+ | |||
+ | // Allocate a test destination buffer (clear with zeroes) |
||
+ | pDest = IExec->AllocVecTags(lSize, |
||
+ | AVT_Type, MEMF_SHARED, |
||
+ | AVT_ClearWithValue, 0, |
||
+ | TAG_DONE); |
||
+ | |||
+ | // Verify we got all resources needed for the test |
||
+ | if ( (NULL != IfslDMA) && (NULL != pSrc) && (NULL != pDest) ) |
||
{ |
{ |
||
+ | bGotResources = TRUE; |
||
− | // Call IfslDMA->CopyMemDMA() to perform the memory copy using the DMA hardware. |
||
+ | } |
||
− | // We use full 64-Bit values to pass in the source and destination to CopyMemDMA |
||
− | // so we need to use a "double cast" on the 32-Bit pointers returned by |
||
− | // AllocVecTags() in order to properly pass in fully extended 64-Bit pointers |
||
− | // (see <resource/fsldma.h> for more details) |
||
− | + | if ( TRUE == bGotResources ) |
|
+ | { |
||
− | (DMAPTR)(uint32)pDest, |
||
+ | // Call IfslDMA->CopyMemDMA() to perform the memory copy using the |
||
− | lSize) ) |
||
+ | // DMA hardware. Wait for the transaction to complete before continuing. |
||
+ | |||
+ | printf("Starting \"Blocking\" DMA Transaction" |
||
+ | " of %ld bytes from 0x%08lx to 0x%08lx...\n",lSize,pSrc,pDest); |
||
+ | fflush(stdout); |
||
+ | |||
+ | if ( TRUE == IfslDMA->CopyMemDMA(pSrc,pDest,lSize) ) |
||
{ |
{ |
||
− | // Success - Do something with the copied data |
+ | // Success - Do something with the copied data and quit |
+ | printf("Returned with Success.\n"); |
||
+ | fflush(stdout); |
||
+ | } |
||
+ | else |
||
+ | { |
||
+ | // Report the error |
||
+ | printf("Received an Error!\n"); |
||
+ | fflush(stdout); |
||
} |
} |
||
− | |||
− | // Free our test buffers |
||
− | IExec->FreeVec(pSrc); |
||
− | IExec->FreeVec(pDest); |
||
} |
} |
||
+ | else |
||
+ | { |
||
+ | printf("Failed to obtain resources, aborting test.\n"); |
||
+ | fflush(stdout); |
||
+ | } |
||
+ | |||
+ | // Free our test buffers (as needed) |
||
+ | if ( NULL != pSrc ) IExec->FreeVec((APTR)pSrc); |
||
+ | if ( NULL != pDest ) IExec->FreeVec(pDest); |
||
} |
} |
||
</syntaxhighlight> |
</syntaxhighlight> |
||
Line 228: | Line 266: | ||
// Call IfslDMA->CopyMemDMATags() to perform the memory copy using the |
// Call IfslDMA->CopyMemDMATags() to perform the memory copy using the |
||
// DMA hardware. Start off the transaction then wait on completion signals. |
// DMA hardware. Start off the transaction then wait on completion signals. |
||
− | |||
− | // We use full 64-Bit values to pass in the source and destination to |
||
− | // IfslDMA->CopyMemDMATags() so we need to use a "double cast" on the |
||
− | // 32-Bit pointers returned by AllocVecTags() in order to properly pass |
||
− | // in fully extended 64-Bit pointers. |
||
− | // (see <resource/fsldma.h> for more details) |
||
printf("Starting \"Non-Blocking\" DMA Transaction" |
printf("Starting \"Non-Blocking\" DMA Transaction" |
||
− | " of %ld bytes from 0x%08lx to 0x%08lx...\n", |
+ | " of %ld bytes from 0x%08lx to 0x%08lx...\n", |
+ | lSize,pPhySrcAttr,pPhyDstAttr); |
||
fflush(stdout); |
fflush(stdout); |
||
+ | |||
− | if ( TRUE == IfslDMA->CopyMemDMATags((CONST_DMAPTR)(uint32)pPhySrcAttr, |
||
− | + | if ( TRUE == IfslDMA->CopyMemDMATags(pPhySrcAttr,pPhyDstAttr,lSize, |
|
− | lSize, |
||
FSLDMA_CM_SourceIsPhysical, TRUE, |
FSLDMA_CM_SourceIsPhysical, TRUE, |
||
FSLDMA_CM_DestinationIsPhysical, TRUE, |
FSLDMA_CM_DestinationIsPhysical, TRUE, |
||
Line 327: | Line 359: | ||
#include <interfaces/fsldma.h> |
#include <interfaces/fsldma.h> |
||
struct fslDMAIFace *IfslDMA = IExec->OpenResource(FSLDMA_NAME); |
struct fslDMAIFace *IfslDMA = IExec->OpenResource(FSLDMA_NAME); |
||
− | </syntaxhighlight> |
||
− | |||
− | ==== Allocating DMA Compliant Memory ==== |
||
− | |||
− | Once we have successfully obtained the DMA resource we can directly use its API. The next step we need to do is to allocate some memory '''which we know is DMA compliant''' for use by the resource. This can be accomplished by using the IExec->AllocVecTags() function together with a couple of MMU functions. Together they allow you to ensuring the memory that is returned is properly aligned, contiguous, cache-inhibited and coherent. |
||
− | |||
− | We use the Tags version of the allocate memory function first so we can allocate a block of memory for our source buffer and fill it with some test value (in this case the byte value 0xB3). |
||
− | |||
− | <syntaxhighlight> |
||
− | </syntaxhighlight> |
||
− | |||
− | We also need a destination buffer for our test. Since it only needs to start out being cleared (filled with zeroes), we can use the simplest form of the allocate memory function here as the allocated memory is cleared by default. |
||
− | |||
− | <syntaxhighlight> |
||
</syntaxhighlight> |
</syntaxhighlight> |
||
Line 369: | Line 387: | ||
In general, the larger the data block the better. |
In general, the larger the data block the better. |
||
+ | |||
+ | == Current fsldma.resource API release notes == |
||
+ | |||
+ | <pre> |
||
+ | fsldma.resource 53.1 (30.9.2019) <jkrueger> |
||
+ | |||
+ | - First release. |
||
+ | |||
+ | |||
+ | fsldma.resource 53.2 (4.11.2019) <jkrueger> |
||
+ | |||
+ | - Replaced the Busy Wait polling of the DMA Channel |
||
+ | completion status with a fully event (interrupt) |
||
+ | driven, multitasking task handler system. |
||
+ | |||
+ | |||
+ | fsldma.resource 53.3 (22.11.2019) <jkrueger> |
||
+ | |||
+ | - Reworked DMACopyMem() to properly handle normal |
||
+ | cache-enabled virtual memory blocks using a |
||
+ | combination of StartDMA()/EndDMA() and "CPU Snoop" |
||
+ | mode on the DMA hardware. |
||
+ | |||
+ | |||
+ | fsldma.resource 53.4 (4.12.2019) <jkrueger> |
||
+ | |||
+ | - Major API rework and streamlining. |
||
+ | Removed all other API calls save DMACopyMem(). |
||
+ | DMACopyMem() was renamed CopyMemDMA() and new |
||
+ | TagItem based versions were added; CopyMemDMATagList() |
||
+ | and CopyMemDMATags(). |
||
+ | |||
+ | Seven new Tags were added to the API for use |
||
+ | with the Tags version of CopyMemDMA. They enable |
||
+ | using any combination of Virtual and Physical |
||
+ | memory copies and support for new "Non-Blocking" |
||
+ | transactions with user Notification via three |
||
+ | possible signals; "Success", "In Progress" and "Error." |
||
+ | |||
+ | Several internal changes and improvements including |
||
+ | but not limited to, streamlining internal signal |
||
+ | handling between DMA Handler Tasks and user tasks |
||
+ | and optimizing transactions using "Basic Chaining Mode." |
||
+ | |||
+ | The DMA hardware's "Basic Chaining Mode" is now used |
||
+ | internally to transfer larger single block sizes. |
||
+ | Blocks greater than FSLDMA_OPTIMAL_BLKSIZE is size |
||
+ | are transferred using "Basic Chaining Mode", while |
||
+ | blocks less than or equal to FSLDMA_OPTIMAL_BLKSIZE |
||
+ | are transferred using "Basic Direct Mode." |
||
+ | |||
+ | Previously, block sizes larger than FSLDMA_OPTIMAL_BLKSIZE |
||
+ | were all transfered using a CPU driven loop of |
||
+ | "Basic Direct Mode" transactions. This internal CPU |
||
+ | driven loop has now been replaced by programming the |
||
+ | hardware to perform a series of block transactions |
||
+ | in a "Chain" as a single hardware transaction. |
||
+ | This also enabled bringing out "Completed Successfully", |
||
+ | "In Progress" (a sub-block has transferred successfully) |
||
+ | and "Error" signaling during "Non-Blocking" transactions. |
||
+ | |||
+ | Note: |
||
+ | Currently only source and destination memory |
||
+ | areas of the same type, Virtual-to-Virtual or |
||
+ | Physical-to-Physical are supported. |
||
+ | |||
+ | Also, only Physical-to-Physical transactions |
||
+ | support "Non-Blocking" transactions and User |
||
+ | Notification signaling. |
||
+ | |||
+ | Future releases will remove these limitations. |
||
+ | </pre> |
Latest revision as of 21:50, 18 October 2023
Contents
Author
Jamie Krueger, BITbyBIT Software Group LLC
Copyright (c) 2019, 2021 Trevor Dickinson
Used by permission.
DMA Engine
Some hardware targets include a DMA engine which can be used for general purpose copying. This article describes the DMA engines available and how to use them.
Hardware Features
The Direct Memory Access (DMA) Engines found in the NXP/Freescale p5020, p5040 and p1022 System On a Chip (SoC)s, as found in the AmigaONE X5000/20, X5000/40 and A1222 respectively, are quite flexible and powerful. Each of these chips contains two distinct engines with four data channels each. This provides the ability to have a total of eight DMA Channels working at once, with up to two DMA transactions actually being executed at the same time (one on each of the two DMA Engines).
Further, each of the four DMA Channels found in a DMA Engine may be individually programmed to handle either; a single transaction, a Chain of transactions, or even Lists of Chains of transactions. The DMA Engines automatically arbitrate control between each DMA Channel following programmed bandwidth settings for each Channel (typically 1024 bytes).
This means that after completing a transfer of 1024 bytes (for example), the hardware will consider switching to the next Channel to allow it to move another block of data, and so on in a round-robin fashion. If all other DMA Channels on a given DMA Engine are idle when arbitration would take place, the hardware will not arbitrate control to another Channel, but simply continue processing the transaction(s) for the Channel it is on.
DMA Copy Memory - Execution Flow Diagram
What a call to perform a DMA copy does internally
As shown in the above diagram, when the user makes a call to request a memory copy be performed by the DMA hardware, the next available DMA Channel is selected for use and a DMA Transaction (source, destination and size) is constructed. The DMA Transaction is then programmed into the DMA Engine which owns the available DMA Channel.
At this point the calling task will Wait() until it hears the transaction has been completed. It will then return to the caller with the result. This provides a basic blocking function, which only returns to the caller once that data has been copied. This single tasking behavior is the simplest to use and what is normally expected by most applications using a memory copy function.
Diagram of multitasking DMA Copies
How multiple simultaneous DMA copies are handled
When multiple user calls requesting a DMA copy arrive at once, each one is handed to a dedicated DMA Channel handling task for processing. As the diagram above demonstrates, there are two separate DMA Engines available, each with four channels that may be programmed at the same time. The hardware will then arbitrate the actual data move across these channels according to their respective bandwidth settings (usually 1024 bytes).
In the diagram above, a separate color indicates a distinct data path from the caller through the DMA hardware to the system RAM. A dashed line of the matching color indicates an Interrupt line signaling the respective DMA Channel Handler with the completion of the transaction. The handler task then signals back to the original caller, which returns to the user with a success or failure result.
All eight DMA Channels can handle each a single block transaction or an entire chain of block transactions before it signals completion and returns to the original caller. If all eight DMA Channels are busy processing their requested transactions when further DMA copy requests arrive, they will each be assigned a DMA Channel to wait on (managed via a Mutex lock on each Channel) and will block until allowed to add their DMA transaction to the Channel's queue.
fsldma.resource
The fsldma.resource API is provided automatically in the kernel for all supported machines (Currently the AmigaONE X5000/20, X5000/40 and A1222).
The FslDMA API
The API provided by the fsldma.resource current consists of:
- Copy Memory functions
Copy Memory Functions
Three API calls make up the FslDMA API; CopyMemDMA(), CopyMemDMATagList() and CopyMemDMATags().
BOOL CopyMemDMA( CONST_APTR pSourceBuffer, APTR pDestBuffer, uint32 lBufferSize ); BOOL CopyMemDMATags( CONST_APTR pSourceBuffer, APTR pDestBuffer, uint32 lBufferSize, uint32 TagItem1, ... ); BOOL CopyMemDMATagList( CONST_APTR pSourceBuffer, APTR pDestBuffer, uint32 lBufferSize, const struct TagItem *tags );
The first call, CopyMemDMA(), attempts to perform a "blocking" (does not return to the user until after the requested transfer has either succeeded or failed) operation. A value of TRUE is returned upon success and FALSE if an error occurred. The CopyMemDMA() call will automatically fall back to using an internal fast CPU copy if the requested size is too small to be efficient, or an error occurred in the DMA transaction.
In theory any contiguous block of memory from 1 Byte up to 4GB in size may be transferred using any one of these calls. In practice the DMA hardware can directly accept memory blocks up to FSLDMA_MAXBLOCK_SIZE (or 64MB - 1 Byte). Therefore any data blocks greater than FSLDMA_MAXBLOCK_SIZE will automatically be sent to the DMA hardware in a series of smaller chunks. For maximum efficiency transfer sizes should be at least 256 Bytes in size and be an even multiple of 64 Bytes. Odd sizes are handled by the hardware but will degrade performance.
Further reference - The fsldma.resource AutoDoc
See the [fsldma.resource AutoDoc] file for more details on each API call.
Example 1: "Blocking" mode
#include <stdio.h> #include <stdlib.h> #include <amiga_compiler.h> #include <exec/types.h> #include <proto/exec.h> #include <dos/dos.h> #include <interfaces/exec.h> #include <interfaces/fsldma.h> // fsldma.resource Example 1, "Blocking" DMA Copy // test using Virtual memory areas int main() { struct fslDMAIFace *IfslDMA = NULL; uint32 lSize = 0; CONST_APTR pSrc = NULL; APTR pDest = NULL; BOOL bGotResources = FALSE; // Obtain the fsldma.resource IfslDMA = IExec->OpenResource(FSLDMA_NAME); // Set the size of our test buffers lSize = FSLDMA_OPTIMAL_BLKSIZE; // About 64 MB // Allocate a test source buffer (fill with some data) pSrc = IExec->AllocVecTags(lSize, AVT_Type, MEMF_SHARED, AVT_ClearWithValue, 0xB3, TAG_DONE); // Allocate a test destination buffer (clear with zeroes) pDest = IExec->AllocVecTags(lSize, AVT_Type, MEMF_SHARED, AVT_ClearWithValue, 0, TAG_DONE); // Verify we got all resources needed for the test if ( (NULL != IfslDMA) && (NULL != pSrc) && (NULL != pDest) ) { bGotResources = TRUE; } if ( TRUE == bGotResources ) { // Call IfslDMA->CopyMemDMA() to perform the memory copy using the // DMA hardware. Wait for the transaction to complete before continuing. printf("Starting \"Blocking\" DMA Transaction" " of %ld bytes from 0x%08lx to 0x%08lx...\n",lSize,pSrc,pDest); fflush(stdout); if ( TRUE == IfslDMA->CopyMemDMA(pSrc,pDest,lSize) ) { // Success - Do something with the copied data and quit printf("Returned with Success.\n"); fflush(stdout); } else { // Report the error printf("Received an Error!\n"); fflush(stdout); } } else { printf("Failed to obtain resources, aborting test.\n"); fflush(stdout); } // Free our test buffers (as needed) if ( NULL != pSrc ) IExec->FreeVec((APTR)pSrc); if ( NULL != pDest ) IExec->FreeVec(pDest); }
Example 2: "Non-Blocking" mode
#include <stdio.h> #include <stdlib.h> #include <amiga_compiler.h> #include <exec/types.h> #include <proto/exec.h> #include <dos/dos.h> #include <interfaces/exec.h> #include <interfaces/fsldma.h> // fsldma.resource Example 2, "Non-blocking" DMA Copy // test using user notifications and Physical memory areas int main() { struct ExecBase *ExecBase = (*(struct ExecBase **)4); struct MMUIFace *IMMU = NULL; struct fslDMAIFace *IfslDMA = NULL; uint32 lSize = 0; CONST_APTR pSrc = NULL; APTR pDest = NULL; CONST_APTR pPhySrcAttr = NULL; APTR pPhyDstAttr = NULL; int8 nSuccessSigNum = -1; int8 nInProgressSigNum = -1; int8 nErrorSigNum = -1; uint32 lSuccessSigMask = 0; uint32 lInProgressSigMask = 0; uint32 lErrorSigMask = 0; uint32 lAllSigsMask = 0; BOOL bGotResources = FALSE; // Obtain the MMU Interface IMMU = (struct MMUIFace *)IExec->GetInterface((struct Library *)ExecBase, "MMU",1,NULL); // Obtain the fsldma.resource IfslDMA = IExec->OpenResource(FSLDMA_NAME); // Set our test buffer size large enough to generate some sub-block transfers lSize = FSLDMA_OPTIMAL_BLKSIZE * 4; // About 256 MB // Allocate a test source buffer (fill with some data) pSrc = IExec->AllocVecTags(lSize, AVT_Type, MEMF_SHARED, AVT_Contiguous, TRUE, AVT_Lock, TRUE, AVT_Alignment, 64, AVT_PhysicalAlignment, 64, AVT_ClearWithValue, 0xB3, TAG_DONE); // Allocate a test destination buffer (clear with zeroes) pDest = IExec->AllocVecTags(lSize, AVT_Type, MEMF_SHARED, AVT_Contiguous, TRUE, AVT_Lock, TRUE, AVT_Alignment, 64, AVT_PhysicalAlignment, 64, AVT_ClearWithValue, 0, TAG_DONE); // Allocate signals to wait on nSuccessSigNum = IExec->AllocSignal((int8)-1); nInProgressSigNum = IExec->AllocSignal((int8)-1); nErrorSigNum = IExec->AllocSignal((int8)-1); // Construct the signal masks we need to Wait() on later. // We are assuming these values are being built with // allocated signals here, but we will verify it before use. lSuccessSigMask = (uint32)(1L << nSuccessSigNum); lInProgressSigMask = (uint32)(1L << nInProgressSigNum); lErrorSigMask = (uint32)(1L << nErrorSigNum); lAllSigsMask = (lSuccessSigMask | lInProgressSigMask | lErrorSigMask); // Obtain the pointer to ourselves (this Task) struct Task *pThisTask = IExec->FindTask(NULL); // Verify we got all resources needed for the test if ( (NULL != IMMU) && (NULL != IfslDMA) && (NULL != pSrc) && (NULL != pDest) && (NULL != pThisTask ) && (-1 != nSuccessSigNum) && (-1 != nInProgressSigNum) && (-1 != nErrorSigNum) ) { bGotResources = TRUE; } if ( TRUE == bGotResources ) { // In this test we are using Physical memory areas // for both source and destination, so before we // hand the transaction to the DMA hardware, we first // need to set the flush the data cache, set the // buffers to cache-inhibited and obtain the Physical // addresses for both buffers // Enter Supervisor Mode APTR pUserStack = IExec->SuperState(); // Flush out any cache for our two buffers IExec->CacheClearE((APTR)pSrc,lSize,CACRF_ClearD); IExec->CacheClearE(pDest,lSize,CACRF_ClearD); // Now set the memory attributes to prevent further cache operations uint32 lSrcMemAttrs = IMMU->GetMemoryAttrs((APTR)pSrc,0); uint32 lDstMemAttrs = IMMU->GetMemoryAttrs(pDest,0); IMMU->SetMemoryAttrs((APTR)pSrc,lSize,(lSrcMemAttrs | FSLDMA_PHYMEM_ATTRS)); IMMU->SetMemoryAttrs(pDest,lSize,(lDstMemAttrs | FSLDMA_PHYMEM_ATTRS)); // Get the Physical addresses for our two buffers pPhySrcAttr = IMMU->GetPhysicalAddress((APTR)pSrc); pPhyDstAttr = IMMU->GetPhysicalAddress(pDest); // Return to User Mode if ( NULL != pUserStack ) IExec->UserState(pUserStack); // Call IfslDMA->CopyMemDMATags() to perform the memory copy using the // DMA hardware. Start off the transaction then wait on completion signals. printf("Starting \"Non-Blocking\" DMA Transaction" " of %ld bytes from 0x%08lx to 0x%08lx...\n", lSize,pPhySrcAttr,pPhyDstAttr); fflush(stdout); if ( TRUE == IfslDMA->CopyMemDMATags(pPhySrcAttr,pPhyDstAttr,lSize, FSLDMA_CM_SourceIsPhysical, TRUE, FSLDMA_CM_DestinationIsPhysical, TRUE, FSLDMA_CM_DoNotWait, TRUE, FSLDMA_CM_NotifyTask, pThisTask, FSLDMA_CM_NotifySignalNumber, nSuccessSigNum, FSLDMA_CM_NotifyProgessSignalNumber, nInProgressSigNum, FSLDMA_CM_NotifyErrorSignalNumber, nErrorSigNum, TAG_DONE) ) { // If the above call sets up correctly, it returns immeditately // So do something here *while* the data is being copied... printf("Doing other stuff before waiting...\n"); fflush(stdout); // Now we will Wait() for the completion signals from the DMA BOOL bStillRunning = TRUE; uint32 lSigReceived = 0; do { // Wait for the DMA completion/status/error signals printf("Waiting for signals...\n"); fflush(stdout); lSigReceived = IExec->Wait( lAllSigsMask | SIGBREAKF_CTRL_C ); // Test for the "In Progress" signal if ( lInProgressSigMask == (lSigReceived & lInProgressSigMask) ) { // Do something on the "In Process" signal and continue... printf("Received an \"In Progress\" signal...\n"); fflush(stdout); } // Test for the "Success" signal if ( lSuccessSigMask == (lSigReceived & lSuccessSigMask) ) { // Success - Do something with the copied data and quit printf("Received an \"Completed Successfully\" signal.\n"); fflush(stdout); bStillRunning = FALSE; } // Test for the "Error" signal if ( lErrorSigMask == (lSigReceived & lErrorSigMask) ) { // Report the error and quit printf("Received an \"Error\" signal!\n"); fflush(stdout); bStillRunning = FALSE; } // Finally, check if we got a Break signal from the user // just in case we want to quit out early for some reason if ( SIGBREAKF_CTRL_C == (lSigReceived & SIGBREAKF_CTRL_C) ) { printf("Received an \"User break\" signal, ending.\n"); fflush(stdout); bStillRunning = FALSE; } } while( TRUE == bStillRunning ); } } else { printf("Failed to obtain resources, aborting test.\n"); fflush(stdout); } // Free our test buffers (as needed) if ( NULL != pSrc ) IExec->FreeVec((APTR)pSrc); if ( NULL != pDest ) IExec->FreeVec(pDest); // Free any signals we (may have) obtained earlier if ( -1 != nSuccessSigNum ) IExec->FreeSignal(nSuccessSigNum); if ( -1 != nInProgressSigNum ) IExec->FreeSignal(nInProgressSigNum); if ( -1 != nErrorSigNum ) IExec->FreeSignal(nErrorSigNum); // Release the MMU Interface (as needed) if ( NULL != IMMU ) IExec->DropInterface((struct Interface *)IMMU); }
Obtaining the fsldma.resource
Breaking the above example down the first thing we do is include the interface header for the fsldma.resource and obtain the resource itself.
#include <interfaces/fsldma.h> struct fslDMAIFace *IfslDMA = IExec->OpenResource(FSLDMA_NAME);
Performance
Testing DMA memory copies vs their CPU based equivalent indicates that the DMA hardware on all three models (X5000/20, X5000/40 and A1222) move data approximately two to three times faster than the CPU does so alone.
It should also be noted that these tests were performed when the CPU was effectively idle. CPU memory copy operations scale down roughly equal with the CPU actively scaling up. So the more the CPU is doing, the longer it takes to complete the memory copy.
Conversely, DMA memory copy operation times are far more predictable as they use the same amount of (minimal) CPU overhead for each copy, and the actual time it takes the DMA hardware to complete the transaction can be calculated to range from all the DMA Channels being idle, to all DMA Channels being busy at once and data moves arbitrating between channels every 1024 bytes.
Optimal use of the DMA Engine
To gain the best performance from the DMA Engines, block sizes of 256 bytes or more should be used. Also, if possible the size should be an even multiple of at least 4 bytes.
Additionally the memory blocks (source and destination) should be aligned to start on a 64 Byte boundary.
The DMA Engine can and does handle misaligned and odd sized data blocks by first shifting the minimum required bytes to correct the alignment, then taking smaller chunks of data until the largest chunk of the block can be moved. It can also handle copies of as little as one byte (don't do this).
However, this does degrade performance.
In general, the larger the data block the better.
Current fsldma.resource API release notes
fsldma.resource 53.1 (30.9.2019) <jkrueger> - First release. fsldma.resource 53.2 (4.11.2019) <jkrueger> - Replaced the Busy Wait polling of the DMA Channel completion status with a fully event (interrupt) driven, multitasking task handler system. fsldma.resource 53.3 (22.11.2019) <jkrueger> - Reworked DMACopyMem() to properly handle normal cache-enabled virtual memory blocks using a combination of StartDMA()/EndDMA() and "CPU Snoop" mode on the DMA hardware. fsldma.resource 53.4 (4.12.2019) <jkrueger> - Major API rework and streamlining. Removed all other API calls save DMACopyMem(). DMACopyMem() was renamed CopyMemDMA() and new TagItem based versions were added; CopyMemDMATagList() and CopyMemDMATags(). Seven new Tags were added to the API for use with the Tags version of CopyMemDMA. They enable using any combination of Virtual and Physical memory copies and support for new "Non-Blocking" transactions with user Notification via three possible signals; "Success", "In Progress" and "Error." Several internal changes and improvements including but not limited to, streamlining internal signal handling between DMA Handler Tasks and user tasks and optimizing transactions using "Basic Chaining Mode." The DMA hardware's "Basic Chaining Mode" is now used internally to transfer larger single block sizes. Blocks greater than FSLDMA_OPTIMAL_BLKSIZE is size are transferred using "Basic Chaining Mode", while blocks less than or equal to FSLDMA_OPTIMAL_BLKSIZE are transferred using "Basic Direct Mode." Previously, block sizes larger than FSLDMA_OPTIMAL_BLKSIZE were all transfered using a CPU driven loop of "Basic Direct Mode" transactions. This internal CPU driven loop has now been replaced by programming the hardware to perform a series of block transactions in a "Chain" as a single hardware transaction. This also enabled bringing out "Completed Successfully", "In Progress" (a sub-block has transferred successfully) and "Error" signaling during "Non-Blocking" transactions. Note: Currently only source and destination memory areas of the same type, Virtual-to-Virtual or Physical-to-Physical are supported. Also, only Physical-to-Physical transactions support "Non-Blocking" transactions and User Notification signaling. Future releases will remove these limitations.