Tuesday, November 25, 2014

AXI interfacing with Chisel

I recently started using Chisel, a hardware construction language from UC Berkeley implemented as a Scala DSL. Although it still has some rough edges, it's definitely usable and I really like how it can map the same description into Verilog or a cycle-accurate C++ simulation.

I'm currently working on making hardware accelerators irregular applications (like graph traversal and sparse matrix operations - oh wait, we can do one with the other).  Although simulation goes a long way for testing and evaluation, I prefer FPGA implementations (even though it takes so much more time and effort, the end product actually is a computer). I found that Chisel also helps out quite a bit with productivity there. Along the way, I made a little something that someone else might find useful: A collection of AXI4 interface definitions and simple peripherals in Chisel. Here's the github repo:

https://github.com/maltanar/axi-in-chisel

I started working on this because I wanted more hands-on experience with Chisel and a faster way of making my hardware prototypes work with AXI interfaces found on e.g Xilinx FPGA/programmable SoCs. There are 50+ signals/ports on a full-blown AXI interface, which can be daunting. However, these can be organized into address and data channels based on the decoupled (ready/valid) abstraction, which fits nicely with Chisel's Decoupled-style interfaces and custom types.

It is worth mentioning that the code here targets the Verilog backend and hardware synthesis only, since there are no testbenches. The generated Verilog should be straightforward to use with the Xilinx IP packager. The peripherals aren't extensively tested, but they performed as expected on a ZedBoard (pushed through Vivado for synthesis).

Right now the repository is a haphazard collection of Chisel source files, with varying degree of comments in each:

SimpleReg - a translation of the AXI Lite slave template (register file) generated by Vivado

SumAccel - read 3 consecutive words from address 0x10000000 using AXI Lite master and sum them (the result can be read through the AXI Lite slave interface)

HPSumAccel - read specified number of words from specified address in large bursts and sum them. Note that number of words should be a multiple of the burst length (set to 512 32-bit words by default), since this doesn't handle chopping the burst into bits. The result and total elapsed cycles can be read through the slave interface.

I realize I could have used DMA IP cores to do most of this, but keeping the interfacing in Chisel has some benefits like tighter integration with the peripherals. Plus it's been a good learning experience :)

Saturday, May 3, 2014

A Programming Utility for the Avnet Spartan-6 Evaluation Kit

While playing around with FPGA-based sparse-matrix vector accelerators on the Avnet Spartan-6 LX16 Evaluation Kit I realized I needed a utility for uploading binary files to the board for testing. The board supports programming over a USB serial connection, but Avnet only provides a Windows-only utility for this purpose, which doesn't have the option to select and send binary files. I was also getting rather sick of having to switch to a Windows virtual machine just to run the programming utility, not to mention this beautiful user interface with jelly buttons and everything:


So I decided to make a small utility that would run on Linux, with simple text-based rx/tx as well as support for FPGA bitfile configuration and sending binary files. The "hardest" part was deciphering the text-based protocol for talking to the on-board PSoC, which is responsible for manipulating the necessary FPGA pins and pushing in the new configuration file. Since the PSoC firmware documentation was nowhere to be found on Avnet's download pages, I ended up using a serial port sniffer to see what the protocol looked like, which was something like this:

(lines prefixed with >> are TX from the progutil to the board)
>> get_config
ack
>> get_ver
3.0.2 ack
>> load_config 1
ack
>> drive_prog 0
ack
>> drive_mode 7
ack
>> spi_mode 1
ack
>> drive_prog 1
ack
>> read_init
ack ack
>> drive_mode 8
ack
>> fpga_rst 1
ack
>> ss_program
ack
>> 
ack
>> read_init
ack ack
>> read_done
ack ack
>> spi_mode 0
ack
>> load_config 0
ack
>> fpga_rst 0
ack

It took a little bit of trial-and-error and some reading of documentation on QThreads (for data rx-tx without blocking the user interface), and the result ended up looking like the following:



I cannot claim huge improvements on the user interface front, but at least it works on Linux and supports the minimal feature set needed to be useful.  I'm planning to add support for power measurements, and the code certainly could use lots of cleanup, but we'll see how much time I have for that. If you are interested, the sources are available at:

https://github.com/maltanar/s6lxek-progutil

I eventually managed to Google my way into finding the firmware documentation, which I've also put on github, in case someone has need of it at some point.

Thursday, April 4, 2013

The first SHMACsim progress presentation

After starting my thesis roughly 2 months ago, I've now reached a point which I could consider to be an internal milestone; a working behavioral simulation of the tile-based heterogeneous multicore architecture I've been working with.

Tomorrow (5th April) I'll be giving a status presentation to the research group I work with here at NTNU, I prepared a Prezi presentation for the occasion. It gives some brief background information on what the SHMAC is and why it's needed, why a simulator for the SHMAC is needed, and proceeds with some details about why SystemC was chosen for the simulation infrastructure and ArchC for the core generation. Contrary to the academic presentation trend there's (unfortunately) a general lack of text on the presentation itself, but it should be enough to get some ideas across :)

Wednesday, March 20, 2013

Extending ArchC cores with TLM memory

Continuing from where I left last post, I want to take a look at TLM, the ArchC TLM port, and the possibilities it brings.

About TLM

I should probably start by saying I didn't really get much out of the first things I read about TLM (might be either that I didn't really read it through, or that it is too abstract a concept to understand without seeing some examples of). TLM stands for Transaction Level Modelling, and Wikipedia defines it as "a high-level approach to modeling digital systems where details of communication among modules are separated from the details of the implementation of functional units or of the communication architecture". The definition is rather vague which is part of made me confused about this, but actually this loose definition is what makes it powerful and interesting.

Let's take an example to get a more solid grip on it: main memory connected to a processor, no caches, no complicated memory hierarchies. A Von Neumann type processor has a rather simple expectation from the main memory it's connected to: it wants to be able to write to the memory at a given address, and similarly read from the memory at a given address. If you wanted to create a model for a memory that satisfies these requirements, you could do it in many ways on many levels. You could, for example, write a piece of C code that fills in/reads out elements of an array with read/write function calls, or a slightly more complicated version with some kind of wait() function to emulate the delays occurring while accessing the memory. Or you could write a VHDL/Verilog description at the RTL (register transfer level). Or go down a level to build this memory transistor by transistor. Alternatively, the processor and the memory may be communication through some kind of bus instead of having a direct connection, and this connection itself can be subject to different levels of modelling.

All of these levels of modelling have their advantages/disadvantages, serving some kind of specific purpose. There is, however, one thing that does not change: the interface to the memory is essentially the same in all these levels of modelling, whereas the gory (or not so gory) details of the actual implementation change according to model. This is essentially what TLM dictates: define the interface, what kind of data is passed back and forth. It's quite alike the object oriented programming concept of interfaces in this respect.

There's of course more than TLM than this, the OSCI (Open SystemC Initiative) has standardized the approach and offer a set of well-defined building blocks for constructing TLM interfaces. It may sound scary, but especially TLM 1.0 is quite simple to understand: it defines interfaces for either uni- or bi-directional communication for any given data type (via C++ templating), either blocking or unblocking.

If that muddled things down instead of clearing them up, the next bit should help more: a real TLM example, from ArchC!

The ArchC TLM port

ArchC normally offers a pretty complete package for playing around with processors, but there's a great deal of flexibility possible when you want to start customizing parts of that package, and this is where TLM comes into play. Starting from ArchC 2.0 (or something close to that :)) a TLM interface is offered for connecting "memory" to ArchC-generated cores. This is how:


AC_ARCH(mips1)
{
  ac_tlm_port DM:16M; // declare a TLM port that can address 16M
  ac_regbank RB:32;
  ac_reg npc;
  ac_reg hi, lo;
  ac_tlm_intr_port inta;

  ac_wordsize 32;

  ARCH_CTOR(mips1) {
    ac_isa("mips1_isa.ac");
    set_endian("big");
  };
};

Easy! So once we have the TLM port, what do we do with it? What does the ArchC TLM interface look like? On the "internal" side (=the side which the processor core itself talks to) the functions inheritsed from ac_inout_if is used, just like regular memory. But it's the "external" side that concerns us: what does the TLM port do to "talk" to the outside world? It's derived from sc_port, which in turn is a port for the following interface:

typedef tlm_transport_if ac_tlm_transport_if;

Now that's a SystemC TLM interface: the template uses ac_tlm_req for sending out requests to the external world and ac_tlm_rsp for getting responses from the external world. Those types are:


/// ArchC TLM request packet.
struct ac_tlm_req {
  ac_tlm_req_type type; // READ, WRITE, LOCK, UNLOCK...
  int dev_id;
  uint32_t addr;
  uint32_t data;
};

/// ArchC TLM response packet.
struct ac_tlm_rsp {
  ac_tlm_rsp_status status; // ERROR, SUCCESS
  ac_tlm_req_type req_type; // same as in request
  uint32_t data;
};

And that's it! That's what ArchC uses when it wants to do something with a piece of external memory. The TLM base interface itself (tlm_transport_if) needs a function called transport that carries the request and returns the response:

ac_tlm_rsp transport(const ac_tlm_req & req);

Having seen these, it shouldn't be difficult to model a simple memory that can be connected to the ArchC TLM port:


class ArchCTLMMemory : public ac_tlm_transport_if {
public:
ArchCTLMMemory(uint32_t size_bytes) {memory = new char[size_bytes];};
~ArchCTLMMemory() {delete memory;};

ac_tlm_rsp transport(const ac_tlm_req & req)
        {
                ac_tlm_rsp response;
                response.status = SUCCESS;
                response.req_type = req.type;
                if(req.type == READ)
                        response.data = *((uint32_t *) memory[req.addr]);
                else if(req.type == WRITE)
                        *((uint32_t *) memory[req.addr]) = req.data;

                return response;
        };

protected:
uint8_t *memory;
};

While it's by no means a complete example, it should be enough to illustrate the simplicity of making an ArchC-interfaceable memory element. And the real power is of course the TLM interface dictates nothing about the memory implementation - you're free to include delays, assertions, stat counters, or routing this memory request to some other component (which is the case for the tile-based system I'm building).


Monday, March 11, 2013

Playing around with ArchC: TLM and multicores

As the final goal of my MSc thesis is to create a simulation framework for the SHMAC (Single ISA Heterogeneous Multicore Architecture Computer) I've been spending some time on creating ArchC/SystemC simulations for multicores and interconnects, and it feels like I'm finally getting to a point where I have a clearer picture of the concepts involved. Or so I hope :)

A bit of a background on SHMAC: it's a tile-based heterogeneous architecture which was first realized on an FPGA last year in the form of this MSc thesis by Leif Tore Rusten and Gunnar Inge Sortland at NTNU. The simulator I'm developing will (hopefully) eventually be used to evaluate interconnect and cache matters for the architecture.

It's mostly the "Processor Design with ArchC" chapter of the "Processor Description Languages" book that motivated me to write this - it gives a little warm-up on the features of ArchC including the TLM memory and interrupt controller port, and then goes on to how these concepts could be used to construct a two-core system.

// ...includes, includes...
int sc_main(int ac, char *av[])
{
    mips1 proc1("p1");
    mips1 proc2("p2");

    someBus bus("b");
    someMemory mem("m");

    proc1.memoryPort(bus);
    proc2.memoryPort(bus);
    bus.memoryPort(mem);

    // ...even more connections, omitted since they're related to the interrupt controller
    proc1.initAndLoad(someBinary);
    proc2.initAndLoad(someBinary);

    sc_start();

    // ..print start and exit
}


Despite how promising it looks, there unfortunately is no source code that follows with the book (at least that I'm aware of) - even though it is rather well described how the top-level components are connected in the code example in the book, how the memory, bus and interrupt controller are actually implemented is not mentioned. And I was unable to find any further code examples using ArchC for multi-core simulation. So I decided to take the plunge into ArchC's documentation and source code to understand how I could make it work, and well, I guess I can say the results are better than expected :)

More implementation details and juicy TLM stuff (not really, it's actually quite simple) coming up in the next post!

Tuesday, February 26, 2013

Architecture simulators, SystemC and ArchC

I've been looking into existing literature and simulators for the heterogeneous simulator I'll be developing, and after discussing some of the options with my supervisor Magnus the final conclusion was to use SystemC together with ArchC for the simulation infrastructure. To give a brief overview of what they are:

  • SystemC is a set of C++ libraries and a simulation kernel, with plenty of useful functionality for creating models of complex hardware systems at different levels. Combining the object oriented paradigm with extra modelling capabilities for concurrency, timing and communication results in a flexible and powerful tool, and you get to decide the level of detail you would like for your models - anything from RTL (a hardware-synthesizable subset exists!) to expressing a whole processor instruction execution cycle with a switch statement. 
  • ArchC is an open-source architecture description language that was built to allow researchers or companies quickly prototype new computer architecture ideas. It can create SystemC simulations of the proposed architecture or create compiled versions for more speed, and it can even generate a GNU bintools suite targeting the architecture you specified!
So what's the first thing you'd like to see when someone starts talking about some new/esoteric language? You'd probably want to see examples. Here's a SystemC example from Wikipedia:

#include "systemc.h"
 
SC_MODULE(adder)          // module (class) declaration
{
  sc_in<int> a, b;        // ports
  sc_out<int> sum;
 
  void do_add()           // process
  {
    sum.write(a.read() + b.read()); //or just sum = a + b
  }
 
  SC_CTOR(adder)          // constructor
  {
    SC_METHOD(do_add);    // register do_add to kernel
    sensitive << a << b;  // sensitivity list of do_add
  }
};

ArchC syntax looks pretty similar (it's inspired by SystemC in any case) with specific language constructs to specify the instruction set architecture (ISA) and the microarchitecture. They have ArchC models for a number of different cores at the ArchC website, check it out!

So the microarchitecture description (well, it's not a complete microarchitecture description, but you can always customize the connections and the components if you want to) looks like this in ArchC:

AC_ARCH(mips1){

  ac_mem   DM:5M; // 5 megs of direct access memory
  ac_regbank RB:32; // register bank
  ac_reg npc;
  ac_reg hi, lo;

  ac_wordsize 32; // 32-bit words

  ARCH_CTOR(mips1) {

    ac_isa("mips1_isa.ac"); // set ISA
    set_endian("big"); // big endian

  };
};

And here is an excerpt from the ISA description and a instruction behaviour description:

AC_ISA(mips1){

  // declare the format of a group of instructions for decoding
  // this can be thought of as parameter type/count declaration
  ac_format Type_R  = "%op:6 %rs:5 %rt:5 %rd:5 %shamt:5 %func:6";
  
  // ... insert more instruction formats here
  
  // which instructions belong to which format?
  ac_instr add, addu, sub, subu, slt, sltu;
  
  // ... insert more instruction-format matchings here
  
  // assembly equivalent of instructions, for bintools generation
addi.set_asm("addi %reg, %reg, %exp", rt, rs, imm);
addi.set_asm("add %reg, %reg, %exp", rt, rs, imm);
addi.set_decoder(op=0x08);
// ...
};
...

// Instruction addi behavior method.
void ac_behavior( addi )
{
  dbg_printf("addi r%d, r%d, %d\n", rt, rs, imm & 0xFFFF);
  RB[rt] = RB[rs] + imm;
  dbg_printf("Result = %#x\n", RB[rt]);
  //Test overflow
  if ( ((RB[rs] & 0x80000000) == (imm & 0x80000000)) &&
       ((imm & 0x80000000) != (RB[rt] & 0x80000000)) ) {
    fprintf(stderr, "EXCEPTION(addi): integer overflow.\n"); exit(EXIT_FAILURE);
  }
};

Tuesday, February 19, 2013

I'm back - with heterogeneous multicore computing!

Following the (bad) blogger tradition of long periods of silence followed by a "I'm back!" post, here we go :)

In the past 2.5 years I did a lot of things that I really felt like I should blogged about here, mostly as part of the fantastic Erasmus Mundus in Embedded Computing Systems (EMECS) master programme, but ah well. I intend to share the rest of my embedded systems and computer architecture adventures here, and if I manage to get back into the writing mode I might even become retrospective and write about some cool stuff I've seen during the Long Period of Silence :P

To summarize the current situation, I'm in the last semester of the EMECS programme writing my Master's thesis, and recently got a PhD position offer from the Norwegian University of Science and Technology (NTNU) Computer Architecture and Design Group. My thesis and the doctoral research for the following 4 years will be in the fascinating world of heterogeneous multicore architectures, and at the heart of the reason for doing all this still lies the same craving which fueled my Google Summer of Code 2010 project at BeagleBoard with dsp-rpc-posix: to make these amazing hardware be used to their full or near-full potential more easily by more developres.

On a more specific level of detail, my MSc thesis is going to be about developing a simulator for a tile-based heterogeneous architecture. I've done a bit of literature research and review of existing simulator work (and most naturally, a brain muddled due to trying to absorb all that information in 1.5 weeks) and for the moment it feels like I'll be basing my work on ArchC / SystemC for a variety of reasons which I'll hopefully go more into soon.

Expect to see blog posts about practical SystemC/ArchC issues, random ramblings about heterogeneous multicore architectures and maybe some cool embedded systems projects soon!