One thing this brought up is the question of how best to design a software toolkit which:
- provides a simple conceptual model for thinking about the data as it goes through various stages of processing and analytics, from data source to answer.
- enables easy flexibility for creating our own recipes for data analytics, simple ones and complex ones with many processing steps.
Data Processing PipelineWe know we will always have a data source - we've called these the text corpora. We also know we want an answer, an output, perhaps in the form of a chart but sometimes just a list or table.
We also know we need to do something to the data between the data source and the answer. Thinking ahead a little, we know we will want to try all kinds of ideas between the source and the answer, and it would be good not to have to reinvent the wheel every time. That suggests having some kind of framework into which we can easily plug our ideas for processing steps. And we very likely will want to to apply more than one of these steps - we saw earlier the application of "lowercase" and "minimum length" steps to our recipes data.
The follows shows such a framework - a data pipeline, into which we easily plug as many processing steps as we like.
This is the framework we want to make for our text mining toolkit.
There are alternative designs to think about too. We might have considered having the data sat in a Python object, and repeatedly mutated by applying methods which change the data. That could work but has disadvantages because you're destroying the data with each mutation.
Pipeline NetworksIt may seem that having a pipeline is less memory efficient, because we're retaining the data that results from each processing step, and also passing it to the next step, but a significant advantage is that we can create more complex networks of pipelines. We'll have to see if the overhead defeats this ambition.
FunctionalThere is also another benefit, which is that the concept of processing steps taking data input(s) and creating data output(s) is simple, and reflects the functional approach to coding. This has two strong advantages:
- It is possible to parallelise the processing (eg using GPUs or clusters), because each flow is independent of another.
- The output of a processing step (function) is only dependent on the input, and no other state, making it much easier to debug pipelines, and more easily make claims about correctness.