Skip to content

Conversation

@karande
Copy link

@karande karande commented Jul 16, 2015

…framework

Apache SAMOA is designed to process streaming data and develop
streaming machine learning
algorithm. Currently, SAMOA framework supports stream data read from
Arff files only.
Thus, while using SAMOA as a streaming machine learning component in
real time use-cases,
writing and reading data from files is slow and inefficient.

A single Kafka broker can handle hundreds of megabytes of reads and
writes per second
from thousands of clients. The ability to read data directly from
Apache Kafka into SAMOA will
not only improve performance but also make SAMOA pluggable to many real
time machine
learning use cases such as Internet of Things(IoT).

GOAL:
Add code that enables SAMOA to read data from Apache Kafka as a stream
data.
Kafka stream reader supports following different options for streaming:

a) Topic selection - Kafka topic to read data
b) Partition selection - Kafka partition to read data
c) Batching - Number of data instances read from Kafka in one read
request to Kafka
d) Configuration options - Kafka port number, seed information, time
delay between two read requests

Components:
KafkaReader - Consists for APIs to read data from Kafka
KafkaStream - Stream source for SAMOA providing data read from Kafka
Dependencies for Kafka are added in pom.xml for in samoa-api component.

…framework

Apache SAMOA is designed to process streaming data and develop
streaming machine learning
algorithm. Currently, SAMOA framework supports stream data read from
Arff files only.
Thus, while using SAMOA as a streaming machine learning component in
real time use-cases,
writing and reading data from files is slow and inefficient.

A single Kafka broker can handle hundreds of megabytes of reads and
writes per second
from thousands of clients. The ability to read data directly from
Apache Kafka into SAMOA will
not only improve performance but also make SAMOA pluggable to many real
time machine
learning use cases such as Internet of Things(IoT).

GOAL:
Add code that enables SAMOA to read data from Apache Kafka as a stream
data.
Kafka stream reader supports following different options for streaming:

a) Topic selection - Kafka topic to read data
b) Partition selection - Kafka partition to read data
c) Batching - Number of data instances read from Kafka in one read
request to Kafka
d) Configuration options - Kafka port number, seed information, time
delay between two read requests

Components:
KafkaReader - Consists for APIs to read data from Kafka
KafkaStream - Stream source for SAMOA providing data read from Kafka
Dependencies for Kafka are added in pom.xml for in samoa-api component.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should factor the versions in a property (as it is done for other dependencies).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@gdfm
Copy link
Contributor

gdfm commented Jul 21, 2015

@karande thanks for posting this patch.
I have a few doubts about the high-level concepts, but it's a good starting point and we should be able to iterate on the design until we converge.
The most important point is separation of concerns: reading from Kafka should not assume a specific format, and parsing should happen in a separate class for handling ARFF format (or other formats as we add support to them).
The format type can be passed as a command line parameter (we rely on the user knowing the format of their data).

@karande
Copy link
Author

karande commented Jul 31, 2015

@gdfm Thank you for your review comments. Kindly review the changes made in recent commit.
Moving forward I am also planning to add Testcases using in-process Kafka for KafkaStream.

@gdfm
Copy link
Contributor

gdfm commented Aug 7, 2015

Thanks @karande.
Given that the patch requires setting up Kafka, could you share instructions to test?

@gdfm
Copy link
Contributor

gdfm commented Sep 9, 2015

@abifet could you have a look at the patch?
I think we are getting there, but there are still a few things to fix.
I'd like to get your opinion first.

@abifet
Copy link
Contributor

abifet commented Oct 14, 2015

@gdfm, @karande I think that the code has improved a lot and it's already there. My main concern right now is how to read sparse instances in the sparse instance format ( like "{1 0.24, 434 0.34, 500:1}") and not only in the dense instance format (like "0.223, .2323, 1").

The code to convert strings to dense and sparse instances is in org.apache.samoa.instances.ArffLoader, and it uses java.io.Reader to get the raw data. One way to do this could be that KafkaReader extends Reader, and then use ArffLoader to get the instances using the KafkaReader.

@gdfm
Copy link
Contributor

gdfm commented Nov 9, 2015

I think the main issue is a separation of concerns:
One thing is the source of the data, another is the data format.
That is, we could have Avro data coming from Kafka, or ARFF data coming from HDFS, and we should be able to support all of them.
Ideally, the source->format interface is unique and simple (e.g., a byte stream), and it's a responsibility of the format to convert the byte stream into a sequence of instances.

@asfgit asfgit force-pushed the master branch 2 times, most recently from f9db1f2 to 1bd1012 Compare March 16, 2016 06:12
@nicolas-kourtellis
Copy link

Should we merge this? Anything else to be added?

@karande
Copy link
Author

karande commented May 26, 2016

Any update on this?

@redsand
Copy link

redsand commented Dec 22, 2018

Is this issue stale?

@gdfm
Copy link
Contributor

gdfm commented Jan 21, 2019

Yes, I think we can safely close this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants