-
Notifications
You must be signed in to change notification settings - Fork 104
SAMOA-40: Add Kafka stream reader modules to consume data from Kafka … #32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…framework Apache SAMOA is designed to process streaming data and develop streaming machine learning algorithm. Currently, SAMOA framework supports stream data read from Arff files only. Thus, while using SAMOA as a streaming machine learning component in real time use-cases, writing and reading data from files is slow and inefficient. A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. The ability to read data directly from Apache Kafka into SAMOA will not only improve performance but also make SAMOA pluggable to many real time machine learning use cases such as Internet of Things(IoT). GOAL: Add code that enables SAMOA to read data from Apache Kafka as a stream data. Kafka stream reader supports following different options for streaming: a) Topic selection - Kafka topic to read data b) Partition selection - Kafka partition to read data c) Batching - Number of data instances read from Kafka in one read request to Kafka d) Configuration options - Kafka port number, seed information, time delay between two read requests Components: KafkaReader - Consists for APIs to read data from Kafka KafkaStream - Stream source for SAMOA providing data read from Kafka Dependencies for Kafka are added in pom.xml for in samoa-api component.
samoa-api/pom.xml
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should factor the versions in a property (as it is done for other dependencies).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
|
@karande thanks for posting this patch. |
|
@gdfm Thank you for your review comments. Kindly review the changes made in recent commit. |
|
Thanks @karande. |
|
@abifet could you have a look at the patch? |
|
@gdfm, @karande I think that the code has improved a lot and it's already there. My main concern right now is how to read sparse instances in the sparse instance format ( like "{1 0.24, 434 0.34, 500:1}") and not only in the dense instance format (like "0.223, .2323, 1"). The code to convert strings to dense and sparse instances is in org.apache.samoa.instances.ArffLoader, and it uses java.io.Reader to get the raw data. One way to do this could be that KafkaReader extends Reader, and then use ArffLoader to get the instances using the KafkaReader. |
|
I think the main issue is a separation of concerns: |
f9db1f2 to
1bd1012
Compare
|
Should we merge this? Anything else to be added? |
|
Any update on this? |
|
Is this issue stale? |
|
Yes, I think we can safely close this. |
…framework
Apache SAMOA is designed to process streaming data and develop
streaming machine learning
algorithm. Currently, SAMOA framework supports stream data read from
Arff files only.
Thus, while using SAMOA as a streaming machine learning component in
real time use-cases,
writing and reading data from files is slow and inefficient.
A single Kafka broker can handle hundreds of megabytes of reads and
writes per second
from thousands of clients. The ability to read data directly from
Apache Kafka into SAMOA will
not only improve performance but also make SAMOA pluggable to many real
time machine
learning use cases such as Internet of Things(IoT).
GOAL:
Add code that enables SAMOA to read data from Apache Kafka as a stream
data.
Kafka stream reader supports following different options for streaming:
a) Topic selection - Kafka topic to read data
b) Partition selection - Kafka partition to read data
c) Batching - Number of data instances read from Kafka in one read
request to Kafka
d) Configuration options - Kafka port number, seed information, time
delay between two read requests
Components:
KafkaReader - Consists for APIs to read data from Kafka
KafkaStream - Stream source for SAMOA providing data read from Kafka
Dependencies for Kafka are added in pom.xml for in samoa-api component.